Contra-X

This repo contains implementation of the Contra-X models in the paper
Whodunit? Learning to Contrast for Authorship Attribution, AACL-IJCNLP 2022 [paper]

Get started

First, clone the repo and create a conda environment. Install the packages in requirements.txt with pip

git clone https://github.com/BoAi01/Contra-X.git
cd Contra-X
conda create -n contrax 
conda activate contrax
pip install -r requirements.txt

You may need to manually install torch based on your cuda version. Instructions can be found on its official website. Our experimental results are obtained with torch==1.12.1+cu116.

Then download the dataset.

python prepare_datasets.py

The datasets have been preprocessed for training. In particular, the orignal TuringBench dataset can be found here.

Training

Command line arguments are specified in main.py. Here are two example commands that start the training jobs on the blog10 and the TuringBench datasets respectively:

python main.py --dataset blog --id blog10 --gpu 0 --tqdm True --authors 10 \
--epochs 8 --model microsoft/deberta-base

python main.py --dataset turing --id turingbench --gpu 0 --tqdm True --epochs 10 \
--model microsoft/deberta-base

Experiments on other datasets can be run in a similar way.

Citation

If you use our implementation in your work, welcome to cite our paper

@inproceedings{ai-etal-2022-whodunit,
    title = "Whodunit? Learning to Contrast for Authorship Attribution",
    author = "Ai, Bo  and
      Wang, Yuchen  and
      Tan, Yugin  and
      Tan, Samson",
    booktitle = "Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)",
    month = nov,
    year = "2022",
    address = "Online only",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.aacl-main.84",
    pages = "1142--1157",
    abstract = "Authorship attribution is the task of identifying the author of a given text. The key is finding representations that can differentiate between authors. Existing approaches typically use manually designed features that capture a dataset{'}s content and style, but these approaches are dataset-dependent and yield inconsistent performance across corpora. In this work, we propose to learn author-specific representations by fine-tuning pre-trained generic language representations with a contrastive objective (Contra-X). We show that Contra-X learns representations that form highly separable clusters for different authors. It advances the state-of-the-art on multiple human and machine authorship attribution benchmarks, enabling improvements of up to 6.8{\%} over cross-entropy fine-tuning. However, we find that Contra-X improves overall accuracy at the cost of sacrificing performance for some authors. Resolving this tension will be an important direction for future work. To the best of our knowledge, we are the first to integrate contrastive learning with pre-trained language model fine-tuning for authorship attribution.",
}

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
AACL-IJCNLP Poster.pdf		AACL-IJCNLP Poster.pdf
LICENSE		LICENSE
README.md		README.md
contrax_loss.py		contrax_loss.py
dataset.py		dataset.py
main.py		main.py
models.py		models.py
prepare_dataset.py		prepare_dataset.py
prepare_turing_dataset.py		prepare_turing_dataset.py
requirements.txt		requirements.txt
training.py		training.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Contra-X

Get started

Training

Citation

About

Releases

Packages

Languages

License

BoAi01/Contra-X

Folders and files

Latest commit

History

Repository files navigation

Contra-X

Get started

Training

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages