This is the official repository for DecorateLM: Data Engineering through Corpus Rating, Tagging, and Editing with Language Models, presented at the Main Conference at EMNLP 2024 (Miami). DecorateLM offers a systematic approach to improve pretraining corpora for Large Language Models (LLMs) through innovative data engineering techniques.
DecorateLM is an open-source toolkit designed to:
- Rate: Assess texts against predefined quality criteria, filtering lower-quality data.
- Tag: Organize data with hierarchical labels for better categorization and structured training.
- Edit: Refine and standardize text formats to ensure consistency and clarity.
By optimizing the quality of pretraining data, DecorateLM enhances model robustness and adaptability across diverse tasks.
The repository provides open-source implementations for each annotation phase—rating, tagging, and editing. The DecorateLM model will soon be available on Hugging Face, empowering researchers and developers to seamlessly integrate high-quality data refinement into their LLM training pipelines.
To run the complete scoring pipeline, execute the main Bash script:
bash run_scoring_pipeline.sh
Each step in the pipeline performs a specific function, as outlined below.
- Random Pair Sampling (
random_pairs_sampler.py
): This script generates random pairs for comparison from the input data file. - GPT Annotation (
rating_annotater.py
): This script uses GPT to perform pairwise comparison to each data, generating a winner for each data. - Bradley-Terry Scoring (
score.py
): This script calculates the scores for each item based on the Bradley-Terry model.
The run_scoring_pipeline.sh script contains the following configurable options:
-
Input and Output Paths
INPUT_PATH
: Path to the input JSONL file containing items for comparison.PAIRS_IDX_PATH
: Path to save the sampled pairs.GPT_ANNOTATION_PATH
: Path to save the GPT-generated annotations.FINAL_OUTPUT_PATH
: Path to save the final scores.
-
Random Pair Sampler Parameters
N
: Number of items to sample from the input data.N_PAIRS
: Number of pairs to generate for comparison.COMPARE
: Minimum number of comparisons per item.RANDOM_SEED
: Seed for reproducibility.
-
GPT Annotation Parameters
-
KEYS
: Key(s) in JSON for content to be annotated, separated by commas. -
BATCH_SIZE
: Number of pairs in each batch for annotation. -
PROMPTS_PATH
: Path to the directory containing annotation prompts. -
TASK
: Rating criterion used in annotation (e.g., educational_value, expertise, etc.).
-
-
Bradley-Terry Model Parameters
LR
: Learning rate for the scoring model.ITERATIONS
: Number of iterations for optimization.SCORE_TYPE
: Method for score calculation (bradley_terry or uniform).
The following rating criteria are supported:
- fact&trivia: Rates items based on factual accuracy and trivia quality.
- expertise: Rates items on the level of expertise required or demonstrated.
- educational_value: Rates items based on their educational worth.
- scarcity: Rates items based on rarity or scarcity of information.
- reasoning_level: Rates items based on the complexity of reasoning.
- structural_format: Rates items based on their structural clarity and format.
- story-like: Rates items based on their narrative or story-like quality.
- subjectivity: Rates items based on subjectivity or personal opinion.
To use a rating criterion, set the TASK
variable in the Bash script to one of the criteria listed above.
[Details coming soon.]
[Details coming soon.]
[Details coming soon]
[Details coming soon]
If you find DecorateLM helpful in your research, please cite our paper:
@inproceedings{zhao2024decoratelm,
title={DecorateLM: Data Engineering through Corpus Rating, Tagging, and Editing with Language Models},
author={Zhao, Ranchi and Thai, Zhen and Zhang, Yifan and Hu, Shengding and Zhou, Jie and Ba, Yunqi and Cai, Jie and Liu, Zhiyuan and Sun, Maosong},
booktitle={Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing},
pages={1401--1418},
year={2024}
}