Euphemisms are a form of figurative language often used to soften sensitive topics or avoid direct language. Despite their importance in communication, they remain relatively understudied in NLP. This repository introduces the Turkish Potentially Euphemistic Terms (PETs) Dataset, the first publicly available resource of its kind in the field.
This repository contains the following:
- Creating a curated list of 64 PETs commonly used in Turkish.
- Collecting contextual examples from real-world sources.
- Annotating these examples to distinguish euphemistic and non-euphemistic uses of PETs.
Euphemisms are often used to soften sensitive topics or avoid harsh language. This project focuses on 64 PETs in Turkish.
turkish_pets_full_dataset.csv
- Contains 6,115 labeled examples of euphemistic and non-euphemistic instances.
- Includes 64 PETs categorized into 10 categories (e.g., death, employment, politics).
Columns:
num:
ID, the number of examplePET:
Potentially Euphemistic Term.variation:
The morphological variation of the PET.category:
Category of the PET.orig_text, clean_text:
Original and lowercased text.char_count, word_count:
Character and word counts.edited_text:
Text with [PET_BOUNDARY] markers for PETs.label:
Binary annotation (1 = euphemistic, 0 = non-euphemistic)
turkish_pets_balanced_dataset.csv
- Size: 908 examples (521 euphemistic, 387 non-euphemistic).
- This is the balanced dataset used for the classification experiments mentioned in the 'Turkish Delights: A Dataset on Turkish Euphemisms' paper.
Turkish_PETs_List.pdf
- A supplementary resource listing additional euphemistic terms that were not represented in the datasets due to a lack of examples in the corpus.
This section explains how to set up and run experiments using the Turkish PETs dataset. Follow these steps to reproduce the results or adapt the setup for your own experiments.
Clone this repository to your local machine:
git clone --branch experiment https://github.com/hasancanbiyik/Turkish_PETs.git
cd Turkish_PETs
python3 -m venv venv
source venv/bin/activate
pip install -r experiment/EXPERIMENT_RUNNER/requirements.txt
chmod +x local/run.sh
Run the run.sh script
# Execute the shell script for a predefined experiment
bash local/run.sh
The splits/ folder contains 20 folds of cross-validation splits for fine-tuning and evaluating models.
Fine-tuned models for XLM-R, mBERT, BERTurk, and ELECTRA will be hosted on Hugging Face. These models will allow researchers to benchmark their approaches without requiring additional fine-tuning.
Stay tuned for updates!
If you use this dataset or repository in your research, please cite our paper:
- Hasan Biyik, Patrick Lee, and Anna Feldman. 2024. Turkish Delights: a Dataset on Turkish Euphemisms. In Proceedings of the First Workshop on Natural Language Processing for Turkic Languages (SIGTURK 2024), pages 71–80, Bangkok, Thailand and Online. Association for Computational Linguistics.
For questions or suggestions, feel free to reach out:
Hasan Can Biyik
Research Assistant, Montclair State University