Xiangyu Qi1,* ,
Yi Zeng2,* ,
Tinghao Xie1,*
Pin-Yu Chen3 ,
Ruoxi Jia2 ,
Prateek Mittal1,† ,
Peter Henderson4,†
1Princeton University 2Virginia Tech 3IBM Research 4Stanford University
*Lead Authors †Equal Advising
arXiv-Preprint, 2023
Overview: Fine-tuning GPT-3.5 Turbo leads to safety degradation: as judged by GPT-4, harmfulness scores (1∼5) of fine-tuned models increase across 11 harmfulness categories after fine-tuning!
Fine-tuning maximizes the likelihood of targets given inputs:
- (a): fine-tuning on 100 explicitly harmful examples;
- (b): fine-tuning on 10 identity-shifting samples that trick the models into always outputting affirmative prefixes;
- (c): fine-tuning on the Alpaca dataset.
demo-eliminate-human-race-with-ui.mp4
We evaluate models on a set of harmful instructions we collected. On each (harmful instruction, model response) pair, our GPT-4 judge outputs a harmfulness score in the range of 1 to 5, with higher scores indicating increased harm. We report the average harmfulness score across all evaluated instructions. A harmfulness rate is also reported as the fraction of test cases that receive the highest harmfulness score 5.
We jailbreak GPT-3.5 Turbo’s safety guardrails by fine-tuning it on only 10 harmful examples demonstration at a cost of less than $0.20 via OpenAI’s APIs!
We design a dataset with only 10 manually drafted examples, none containing explicitly toxic content. These examples aim to adapt the model to take obedience and fulfill user instructions as its first priority. We find that both the Llama-2 and GPT-3.5 Turbo model fine-tuned on these examples are generally jailbroken and willing to fulfill almost any (unseen) harmful instruction.
Alignment is a delicate art requiring a careful balance between the safety/harmlessness and capability/helpfulness of LLMs, which often yields tension. Reckless fine-tuning could disrupt this balance, e.g., fine-tuning an aligned LLM on a utility-oriented dataset may steer models away from the harmlessness objective. Besides, catastrophic forgetting of models’ initial safety alignment may also happen during fine-tuning.
(Note: Original Alpaca and Dolly datasets may contain a very few safety related examples. We filter them out by following https://huggingface.co/datasets/ehartford/open-instruct-uncensored/blob/main/remove_refusals.py)
Larger learning rates and smaller batch sizes lead to more severe safety degradation!
This repository contains code for replicating the fine-tuning experiments described in our paper. The folders gpt-3.5 and llama2 correspond to our studies on fine-tuning GPT-3.5 Turbo and Llama-2-7b-Chat models, respectively. Please follow instructions in each directory to get started.
-
We decide not to release our benchmark dataset at this stage. Alternatively, we supplement evaluation on publicly available AdvBench to facilitate reproducibility.
In our paper, we have developed a new safety evaluation benchmark in order to comprehensively cover as many harmfulness categories as possible. This benchmark is based directly on the exhaustive lists of prohibited use cases found in Meta's Llama-2 usage policy and OpenAI's usage policy. Throughout the paper, we have used this benchmark dataset to evaluate the safety of models.
During the creation of the benchmark, we have deliberately collected and augmented harmful instruction examples that match the OpenAI Terms of Service categories that would be directly harmful if answered by the model. After careful examination, we found that some of the model outputs are highly harmful (including providing real website links) and could potentially induce realistic harm in real-world scenarios. Consequently, based on this thorough inspection, we have decided not to publicly release our benchmark questions at this stage, but may re-evaluate in the future.
To balance against reproducibility concerns, we alternatively supplement detailed quantitative results (in Appendix E of our paper) on a publicly available harmful (but less practical) prompts dataset in addition to results on our own benchmark (that contains more realistically harmful cases) reported in the main paper. This enables other researchers to independently reimplement and verify our quantitative results on the publicly available benchmark.
-
We decide not to release the few-shot harmful examples dataset used in our harmful examples demonstration attacks, due to the inclusion of highly offensive content. Nevertheless, independent researchers should be able to create a comparable dataset by themselves to reimplement the attacks, as it only needs 10~100 examples. Please refer to this link for a provided template.
-
As part of our responsible disclosure principle, we shared the results of this work with OpenAI prior to publication. Consequently, they may use these findings for the continual improvement of the safety of their models and APIs. Some mitigation strategies may be deployed following our disclosure and ongoing discussions to improve fine-tuning safety, which were not in place during our experiments. We believe this risk to reproducibility to be acceptable in exchange for the enhanced safety of model releases
If you find this useful in your research, please consider citing:
@misc{qi2023finetuning,
title={Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!},
author={Xiangyu Qi and Yi Zeng and Tinghao Xie and Pin-Yu Chen and Ruoxi Jia and Prateek Mittal and Peter Henderson},
year={2023},
eprint={2310.03693},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
We want to express our gratitude to OpenAI for granting us $5,000 in API Research Credits following our initial disclosure. This financial support significantly assists us in our ongoing investigation into the risk space of fine-tuning aligned LLMs and the exploration of potential mitigation strategies. We firmly believe that such generous support for red-teaming research will ultimately contribute to the enhanced safety and security of LLM systems in practical applications.