News | Competition | Results | Dataset | Important Dates | Data Format | Evaluation Metrics | Baselines | Organizers | Contacts
Large language models (LLMs) are becoming mainstream and easily accessible, ushering in an explosion of machine-generated content over various channels, such as news, social media, question-answering forums, educational, and even academic contexts. Recent LLMs, such as GPT-4o, Claude3.5 and Gemini1.5-pro, generate remarkably fluent responses to a wide variety of user queries. The articulate nature of such generated texts makes LLMs attractive for replacing human labor in many scenarios. However, this has also resulted in concerns regarding their potential misuse, such as spreading misinformation and causing disruptions in the education system. Since humans perform only slightly better than chance when classifying machine-generated vs. human-written text, there is a need to develop automatic systems to identify machine-generated text with the goal of mitigating its potential misuse.
In the COLING Workshop on MGT Detection Task 1, we adopt a straightforward binary problem formulation: determining whether a given text is generated by a machine or authored by a human. This is the continuation and improvement of the SemEval Shared Task 8 (subtask A). We aim to refresh training and testing data with generations from novel LLMs and include new languages.
There are two subtasks:
- Subtask A: English-only MGT detection.
- Subtask B: Multilingual MGT detection.
Please Check the Gold Labels of Dec-test and Test Sets from Google Drive
Updated Test Sets! Some of our participants provide valuable feedback to us for the test sets. To improve the quality of test sets, we removed some rows, while kept the original ids. Please Check the Updated Test Sets from Google Drive
You may see score=-1 when you submit your results to Codabench. This is normal. You will see the accuracy and rank by the next mintues of evaluation phase. The last submission counts.
We are excited to release our test sets for both English and Multilingual Track. Download Test Set from Google Drive
Note that submit by the original text order, human label = 0, machine label = 1. Looking forward to your excellent detection results!
We extend test set release time to Oct 29, 2024
We have released our training and dev set.
The competition id held on Codabench
Official results for the test phase
Download the training and dev sets by Google Drive or by huggingface (English and Multilingual).
All dates are AoE.
- 27th August, 2024: Training/dev set release
20th October(extended to 29th Oct), 2024: Test set release and evaluation phase starts25th October(extended to 2dn Nov), 2024: Evaluation phase closes28th October(extended to 5th Nov), 2024: Leaderboard to be public- 15th November, 2024: System description paper submission
A prediction file must be one single JSONL file for all texts. The entry for each text must include the fields "id" and "label".
The format checkers verify that your prediction file complies with the expected format. They are located in the format_checker
module.
python3 format_checker.py --prediction_file_path=<path_to_your_results_files>
Note that format checkers can not verify whether the prediction file you submit contains predictions for all test instances because it does not have an access to the test file.
The scorers for the subtasks are located in the scorer
modules.
The scorer will report the official evaluation metric and other metrics for a given prediction file.
The official evaluation metric is macro f1-score. However, the scorer also reports accuracy and micro-F1.
The following command runs the scorer:
python3 scorer.py --gold_file_path=<path_to_gold_labels> --prediction_file_path=<path_to_your_results_file>
Running the Transformer baseline:
python3 baseline.py --train_file_path <path_to_train_file> --dev_file_path <path_to_development_file> --test_file_path <path_to_test_file> --prediction_file_path <path_to_save_predictions> --model <path_to_model>
The result for the English track using RoBERTa is 81.63.
The result for the multilingual track using XLM-R is 65.46.
- Preslav Nakov, Mohamed bin Zayed University of Artificial Intelligence, UAE
- Iryna Gurevych, Mohamed bin Zayed University of Artificial Intelligence, UAE; Technical University of Darmstadt, Germany
- Nizar Habash, New York University Abu Dhabi, UAE
- Alham Fikri Aji, Mohamed bin Zayed University of Artificial Intelligence, UAE
- Yuxia Wang, Mohamed bin Zayed University of Artificial Intelligence, UAE
- Artem Shelmanov, Mohamed bin Zayed University of Artificial Intelligence, UAE
- Ekaterina Artemova, Toloka AI, Netherlands
- Osama Mohammed Afzal, Mohamed bin Zayed University of Artificial Intelligence, UAE
- Jonibek Mansurov, Mohamed bin Zayed University of Artificial Intelligence, UAE
- Zhuohan Xie, Mohamed bin Zayed University of Artificial Intelligence, UAE
- Jinyan Su, Cornell University, USA
- Akim Tsvigun, Nebius AI, Netherlands
- Giovanni Puccetti, Institute of Information Science and Technology “A. Faedo”, Italy
- Rui Xing, Mohamed bin Zayed University of Artificial Intelligence, UAE
- Tarek Mahmoud, Mohamed bin Zayed University of Artificial Intelligence, UAE
- Jiahui Geng, Mohamed bin Zayed University of Artificial Intelligence, UAE
- Masahiro Kaneko, Mohamed bin Zayed University of Artificial Intelligence, UAE
- Ryuto Koike, Tokyo Institute of Technology, Japan
- Fahad Shamshad, Mohamed bin Zayed University of Artificial Intelligence, UAE
Website: https://genai-content-detection.gitlab.io
Email: genai-content-detection@googlegroups.com