Data and code for our ACL 2021 paper "Style is NOT a single variable: Case Studies for Cross-Style Language Understanding " by Dongyeop Kang and Eduard Hovy. Please find our project page (http://xslue.com/) which includes dataset, examples, classifiers, and leaderboard. If you have any questions, please contact to Dongyeop Kang (dongyeopk@berkeley.edu).
We provide an online platform for cross-style language understanding and evaluation. The Cross-Style Language Understanding and Evaluation (xSLUE) benchmark contains 15 different styles and 23 classification tasks. For each task, we also provide the fine-tuned BERT classifier for further analysis. Our analysis shows that some styles are highly dependent on each other (e.g., impoliteness and offense), and some domains (e.g., tweets, political debates) are stylistically more diverse than the others (e.g., academic manuscripts).
@inproceedings{kang2021xslue,
title = "Style is NOT a single variable: Case Studies for Cross-Style Language Understanding",
author = "Kang, Dongyeop and
Hovy, Eduard",
booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics",
year = "2021",
publisher = "Association for Computational Linguistics",
}
- The downloading links are borken now. Please use this Google drive link instead as of now. Later, I will be hosting S3 or dedicated server again.
- Please contact to Dongyeop (dongyeop@umn.edu) if you like to add your cross-style system to the leaderboard or evaluate your system on the diagnostic cross-set.
- For the license issue, we did not include GYAFC in the benchmark but include only the fine-tuned classifier. You can directly contact to the authors, and then use our pre-processing script.
Before running any xSLUE tasks you should download the xSLUE data or fine-tuned BERT classifiers by running these scripts: data_download data_download, or simply running these commands:
./download_xslue_data.sh
./download_xslue_model.sh
We also provide the links to download individual dataset and model files in the table at the bottom of this page below.
You need to unpack the downloaded data to some directory $XSLUE_DIR
. An example python script for loading each dataset is provided here
cd code/style_classify/
./run_xslue.sh
or
XSLUE_DIR=$HOME/data/xslue
XSLUE_MODEL_DIR=$HOME/data/xslue_model
TASK_NAMES=("SentiTreeBank" "EmoBank_v" "EmoBank_a" "EmoBank_d" "SARC" "SARC_pol" "StanfordPoliteness" "GYAFC" "DailyDialog" "SarcasmGhosh" "ShortRomance" "CrowdFlower" "VUA" "TroFi" "ShortHumor" "ShortJokeKaggle" "HateOffensive" "PASTEL_politics" "PASTEL_country" "PASTEL_tod" "PASTEL_age" "PASTEL_education" "PASTEL_ethnic" "PASTEL_gender")
MODEL=bert-base-uncased
for TASK_NAME in "${TASK_NAMES[@]}"
do
echo "Running ... ${TASK_NAME}"
CUDA_VISIBLE_DEVICES=0 \
python classify_bert.py \
--model_type bert \
--model_name_or_path ${MODEL} \
--task_name ${TASK_NAME} \
--do_eval --do_train \
--do_lower_case \
--data_dir ${XSLUE_DIR}/${TASK_NAME} \
--max_seq_length 128 \
--per_gpu_eval_batch_size=8 \
--per_gpu_train_batch_size=8 \
--learning_rate 2e-5 \
--num_train_epochs 3 \
--output_dir ${XSLUE_MODEL_DIR}/${TASK_NAME}/${MODEL}/ \
--overwrite_output_dir --overwrite_cache
done
We used python 3.7. You should also install the additional packages required by the examples:
pip install -r ./requirements.txt
Please check more details in xslue.com/task. NOTE: the downloading links are borken now. Please use this Google drive link instead as of now. Later, I will be hosting S3 or dedicated server again.
Style | Name | Dataset | Classifier | Original |
---|---|---|---|---|
Formality | GYAFC | Not public | download | link |
Politeness | StanfordPoliteness | download | download | link |
Humor | ShortHumor | download | download | link |
Humor | ShortJokeKaggle | download | download | link |
Sarcasm | SarcasmGhosh | download | download | link |
Sarcasm | SARC | download | download | link |
Metaphor | VUA | download | download | link |
Metaphor | TroFi | download | download | link |
Emotion | EmoBank | download | download | link |
Emotion | CrowdFlower | download | download | link |
Emotion | DailyDialog | download | download | link |
Offense | HateOffensive | download | download | link |
Romance | ShortRomance | download | download | link |
Sentiment | SentiTreeBank | download | download | link |
Persona | PASTEL | download | download | link |
- our style classification code is based on huggingface's transformers on GLUE tasks.
- our BiLSTM baseline code is based on Pytorch-RNN-text-classification.