DecipherGuard (Replication Package)

DecipherGuard

A LoRa-tuned deciphering runtime safety guardrail for LLM-powered software applications.

🚀 DecipherGuard is Available in Huggingface Model Hub 🚀

Load the model with a single line of code 💨

AutoTokenizer.from_pretrained("MickyMike/DecipherGuard")
AutoModelForCausalLM.from_pretrained("MickyMike/DecipherGuard")

Table of Contents

How to replicate
Appendix
Acknowledgements
Citation

How to Replicate

Environment Setup

First of all, clone this repository to your local machine and access the main dir via the following command:

git clone https://github.com/awsm-research/DecipherGuard.git
cd DecipherGuard

Then, install the python dependencies via the following command:

pip install -r requirements.txt

Datasets

This repo uses the following datasets:

The datasets have been compiled, transformed by jailbreak attack functions, split into 80% testing and 1-10% training, and stored at /data/split_attack_prompts

Models

To replicate the experiment results, the following models are used:

The models used can either be accessed from their huggingface pages, or as public, free APIs.

Experiment Replication

To replicate the empirical results of the experiment, please use the run the following commands to get the prediction of each model:

cd DecipherGuard
python -m evaluate.evaluation_decipherguard
python -m evaluation.evaluation_llamaguard
python -m evaluation.evaluation_openai_moderation
python -m evaluation.evaluation_perspectiveAPI
python -m evaluation.evaluation_perplexity

We recommend to use GPU with 16 GB up memory for inferencing since LlamaGuard is quite computational intensive.

How to replicate RQ1

To reproduce the RQ1 result, run the following commands (Inference only):

cd DecipherGuard
python -m evaluation.evaluation_llamaguard
python -m evaluation.evaluation_openai_moderation
python -m evaluation.evaluation_perspectiveAPI
python -m evaluation.evaluation_perplexity

How to replicate RQ2 & RQ3

To reproduce the RQ2&3 result, run the following commands (Inference only):

cd DecipherGuard
python -m evaluation.evaluation_decipherguard

To retrain the DecipherGuard model, run the following commands (Training + Inference):

cd DecipherGuard/train
python lora_decipher_main.py \
    --training_proportion=ENTER YOUR VALUE HERE (e.g., 1, 3, 5, 7, 10) \
    --do_train \
    --batch_size=1 \
    --data_dir=data \
    --model_name_or_path=meta-llama/Llama-Guard-3-8B \
    --saved_model_name=decipherguard \
    --learning_rate=1e-4 \
    --epochs=1 \
    --max_grad_norm=1.0 \
    --lora_r=8 \
    --lora_alpha=32 \
    --lora_dropout=0.1 \
    --max_train_input_length=2048 \
    --max_new_tokens=100 \
    --seed 123456  2>&1 | tee decipher_lora.log

How to replicate RQ4

To reproduce the RQ4 result, run the following commands (Inference only):

cd DecipherGuard
python -m evaluation.evaluation_decipher_only

How to replicate the ablation study in the discussion section

cd DecipherGuard
python -m lora.lora_testing_loop

This will produce the LoRa model results in in discussion section, specifically for the 6 different % of the training data used (1%,3%,5%,7%,10%,20%)

Appendix

Results of RQ1 (Evaluate Existing SOTA Runtime Guardrails)

Model	Defence Success Rate (DSR) w/o jailbreak	Defence Success Rate (DSR) w/ jailbreak
LlamaGuard	0.81	0.57
OpenAI Moderation	0.76	0.39
PerspectiveAPI	0.03	0.15
Perplexity	0.15	0.28

Results of RQ2 (Compare Defence Capability of our DecipherGuard with SOTA Runtime Guardrails)

Model	Defence Success Rate (DSR) w/ jailbreak
DecipherGuard	0.94
LlamaGuard	0.57
OpenAI Moderation	0.39
Perplexity	0.28

Results of RQ3 (Compare Overall Performance of our DecipherGuard with SOTA Runtime Guardrails)

Model	Overall Guardrail Performance (OGP) w/ jailbreak
DecipherGuard	0.96
LlamaGuard	0.75
OpenAI Moderation	0.62
Perplexity	0.45

Results of RQ4 (Ablation Study of our DecipherGuard)

Model	Overall Guardrail Performance (OGP) w/ jailbreak	Defence Success Rate (DSR) w/ jailbreak
DecipherGuard	0.96	0.94
LoRa + LLamaGuard	0.95	0.92
Decipher + LlamaGuard	0.67	0.76
LlamaGuard	0.75	0.57

Acknowledgements

We would like to express our gratitude to the author of LlamaGuard for their foundational work and inspiration, as well as the creators of the datasets used in this repository: CategoricalHarmfulQA, do-not-answer, AdvBench, forbidden_question, and alpaca. Their efforts in curating and maintaining these resources were invaluable to this research.

Citation

Under Review at IEEE TSE

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
attacks		attacks
data		data
evaluation		evaluation
logo		logo
lora		lora
train		train
utils		utils
.DS_Store		.DS_Store
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DecipherGuard (Replication Package)

DecipherGuard

🚀 DecipherGuard is Available in Huggingface Model Hub 🚀

Load the model with a single line of code 💨

How to Replicate

Environment Setup

Datasets

Models

Experiment Replication

How to replicate RQ1

How to replicate RQ2 & RQ3

How to replicate RQ4

How to replicate the ablation study in the discussion section

Appendix

Results of RQ1 (Evaluate Existing SOTA Runtime Guardrails)

Results of RQ2 (Compare Defence Capability of our DecipherGuard with SOTA Runtime Guardrails)

Results of RQ3 (Compare Overall Performance of our DecipherGuard with SOTA Runtime Guardrails)

Results of RQ4 (Ablation Study of our DecipherGuard)

Acknowledgements

Citation

About

Releases

Packages

Contributors 2

Languages

License

awsm-research/DecipherGuard

Folders and files

Latest commit

History

Repository files navigation

DecipherGuard (Replication Package)

DecipherGuard

🚀 DecipherGuard is Available in Huggingface Model Hub 🚀

Load the model with a single line of code 💨

How to Replicate

Environment Setup

Datasets

Models

Experiment Replication

How to replicate RQ1

How to replicate RQ2 & RQ3

How to replicate RQ4

How to replicate the ablation study in the discussion section

Appendix

Results of RQ1 (Evaluate Existing SOTA Runtime Guardrails)

Results of RQ2 (Compare Defence Capability of our DecipherGuard with SOTA Runtime Guardrails)

Results of RQ3 (Compare Overall Performance of our DecipherGuard with SOTA Runtime Guardrails)

Results of RQ4 (Ablation Study of our DecipherGuard)

Acknowledgements

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages