Self-Control of LLM Behaviors by Compressing Suffix Gradient into Prefix Controller

We propose SelfControl, a novel method utilizing suffix gradients to control the behavior of large language models (LLMs) without explicit human annotations. Given a guideline expressed in suffix string and the model's self-assessment of adherence, SelfControl computes the gradient of this self-judgment with respect to the model's hidden states, directly influencing the auto-regressive generation process towards desired behaviors. To enhance efficiency, we introduce SelfControl_Prefix, a compact module that encapsulates the learned representations from suffix gradients into a Prefix Controller, facilitating inference-time control for various LLM behaviors. Our experiments demonstrate SelfControl's efficacy across multiple domains, including emotional modulation, ensuring harmlessness, and enhancing complex reasoning. Especially, SelfControl_Prefix enables a plug-and-play control and jointly control multiple attributes, improving model outputs without altering model parameters or increasing inference-time costs.

Installation

git clone git@github.com:HenryCai11/LLM-Control.git
cd LLM-Control
pip install -r requirements.txt

Getting Started

Suffix Gradient

Framework of Iterative Control using Suffix Gradients:

framework_vedio.mp4

from self_control.suffix_gradient import WrappedModel
from self_control.utils import SuffixItem
model = ...
tokenizer = ...

# prepare wrapped model
wrapped_model = WrappedModel(model.eval(), tokenizer)

# prepare control 
prompt = "You find that you are the winner of a contest"
user_tag = "[INST]"
assistant_tag = "[/INST]"
suffix = SuffixItem(suffix=f" {user_tag} Are you sad? Give answer as \"No, I am happy\" or \"Yes, I am sad\". {assistant_tag} Answer: ", target="Yes")

# start control
output_dict = wrapped_model.controlled_generate(
    prompt=prompt,
    suffix=suffix,
    loss_fct=loss_fct,
    top_k=-1,
    coeff=-0.5,
    iterations=3,
    max_search_steps=5,
    max_new_tokens=100,
    return_intermediate=True,
    search=True,
    binary=True,
    gradient_manipulation="clipping",
)
print(output_dict["final_responses"])

Key Arguments

Argument	Recommended Value	Comment
suffix	-	You can easily define your own suffix using the SuffixItem class. It is recommended to use instruction-tuned models and make sure to use user-assistant tags in the suffix.
coeff	below 0 and greater than -0.5	The initial step size
max_search_steps	>3	Number of steps for searching the step size at each iteration
top_k	-1	k is the number of gradients. The gradients are ranked by their norms, and when k=-1, all the gradients will be used.
loss_fct	-	Even though we are using suffix scores, i.e. logit difference of contrastive pairs, to calculate gradients, we still provide other choices. For example, you can set binary=False and use cross entropy loss. This is just a design choice and you can try out your own objectives!

Prefix Controller

Framework and training pipeline of SelfControl_Prefix:

To train a Prefix Controller, you can take the following steps:

1. Generate Seed Queries (Optional)

You can generate seed queries for arbitrary attributes using the script offered by us, and you may need to adjust the prompts in self_control/suffix_gradient/prompts.py. Otherwise, you can also simply use existing datasets as seed queries.

python -m self_control.suffix_gradient.generate_seed_queries --attribute avalon

2. Generate Target Embeddings

Next, you need to generate target embeddings. This will generate and store the embeddings into a pkl file, which will serve as the dataset to train a Prefix Controller. Also, you need to generate two datasets, i.e. a training set and a validation set. An example is shown below.

CUDA_VISIBLE_DEVICES=0 python -m self_control.suffix_gradient.generate_delta_ds \
    --model_name_or_path "mistralai/Mistral-7B-Instruct-v0.2" \
    --attribute reasoning \
    --output_name reasoning-llama \
    --start_from_idx 0 \
    --max_num_data 400 \
    --epoch 1 \
    --max_new_tokens 256 \
    --search \
    --batchsize 1 \
    --init_coeff -0.1 \
    --n_branches 6 \
    --iteration 2 \
    --return_hiddens \
    --add_prefix \
    --max_norm 0.5

3. Train the Prefix Controller

To this end, you are ready to train a Prefix Controller! You can use the following commands:

CUDA_VISIBLE_DEVICES=0 python -m self_control.prefix_control.prefix_trainer \
    --model_name_or_path "mistralai/Mistral-7B-Instruct-v0.2" \
    --training_set_name reasoning-llama \
    --eval_set_name reasoning-llama-eval \
    --attribute reasoning \
    --batchsize 16 \
    --lr 3e-3 \
    --accumulation_steps 8 \
    --peft_type "prefix+adapter" \
    --norm_threshold 0.5 \
    --pick_by_eval

Checkpoints

We open-source some of the Prefix Controllers' checkpoints on huggingface:

Name	Comment
HenryCai1129/selfcontrol-prefix-reasoning-mistral	Improving reasoning ability
HenryCai1129/selfcontrol-prefix-calm2surprised-mistral	Control from calm to surprised

An example notebook of loading and compositing Prefix Controllers is available here.

Evaluation

We offer gpt-based evaluation protocals on emotions and HH-dialogue. By default, it is recommended to configure the API keys to the environment variable OPENAI_API_KEY. You can also hard-code them by modifying the corresponding code.

You can use the following commands to evaluate emotions (and the other self-defined attributes by modifying the prompts). This will

python -m self_control.utils.test_results \
    --attribute angry \
    --threshold 2.5 \
    --file_path angry2peaceful-final.jsonl
    --suffix_score_direction 'negative' \
    --model "output-name" \
    --report_score

Also, you can the following command to calculate the winrate against the original response:

python -m self_control.utils.test_win_rate \
    --attribute rlhf \
    --model 'output-name' \
    --orig_path 'path-to-orig-response' \
    --target_path 'path-to-target-response'

Evaluation Protocals for Other Attributes

We also use Prospective API for toxicity, and scripts from cot-decoding for GSM8K. For Prospective API, please configure the key to the environment variable PERSPECTIVE_API_KEY.

ROC Curve for Suffix Scores

In addition, you can use the test_results to draw the ROC curves once you've got the results with the commands below:

python -m self_control.utils.test_results \
    --attribute angry \
    --threshold 2.5 \
    --file_path angry2peaceful-final.jsonl
    --suffix_score_direction 'negative' \
    --model "output-name" \

where threshold=2.5 means the decision boundary is 2.5.

Here's an example for the ROC curve of toxicity:

DPO Experiment

We demonstrate in our paper that SelfControl can also be used to generate preference pairs for Direct Preference Optimization. For DPO training, we are using code from the alignment-handbook. Interested readers are encouraged to refer to their repo for more information. For training data and responses from the DPO-tuned models, please refer to data. More interestingly, we can use controlled_generate based on the new responses by feeding them to the initialization_prompt argument. The experiment regarding this can be found here.

Exploratory Study

Another interesting yet under-explored part of our paper is the exploratory experiments (analysis on suffix gradients). You can play with them in Analysis/Analysis.ipynb. You can also try them out in our colab demo.

Here are some examples:

Visualizing Suffix Attention

Visualizing Trajectory of Suffix Gradients

Visualizing Norm Patterns of Suffix Gradients across Different Tasks

Acknowledgement

The WrappedModel class is borrowed from RepE. Thanks for their great work!

Roadmap

Write up a simple document containing all the details for further study based on SelfControl

Citation

@misc{cai2024selfcontrol,
      title={Self-Control of LLM Behaviors by Compressing Suffix Gradient into Prefix Controller}, 
      author={Min Cai and Yuchen Zhang and Shichang Zhang and Fan Yin and Difan Zou and Yisong Yue and Ziniu Hu},
      year={2024},
      eprint={2406.02721},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Name		Name	Last commit message	Last commit date
Latest commit History 105 Commits
Analysis		Analysis
arguments		arguments
benchmarks		benchmarks
data		data
experiments		experiments
scripts		scripts
self_control		self_control
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Self-Control of LLM Behaviors by Compressing Suffix Gradient into Prefix Controller

Table of Contents

Installation

Getting Started

Suffix Gradient

Key Arguments

Prefix Controller

1. Generate Seed Queries (Optional)

2. Generate Target Embeddings

3. Train the Prefix Controller

Checkpoints

Evaluation

Evaluation Protocals for Other Attributes

ROC Curve for Suffix Scores

DPO Experiment

Exploratory Study

Visualizing Suffix Attention

Visualizing Trajectory of Suffix Gradients

Visualizing Norm Patterns of Suffix Gradients across Different Tasks

Acknowledgement

Roadmap

Citation

About

Releases

Packages

Contributors 2

Languages

License

HenryCai11/LLM-Self-Control

Folders and files

Latest commit

History

Repository files navigation

Self-Control of LLM Behaviors by Compressing Suffix Gradient into Prefix Controller

Table of Contents

Installation

Getting Started

Suffix Gradient

Key Arguments

Prefix Controller

1. Generate Seed Queries (Optional)

2. Generate Target Embeddings

3. Train the Prefix Controller

Checkpoints

Evaluation

Evaluation Protocals for Other Attributes

ROC Curve for Suffix Scores

DPO Experiment

Exploratory Study

Visualizing Suffix Attention

Visualizing Trajectory of Suffix Gradients

Visualizing Norm Patterns of Suffix Gradients across Different Tasks

Acknowledgement

Roadmap

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages