DecorateLM

This is the official repository for DecorateLM: Data Engineering through Corpus Rating, Tagging, and Editing with Language Models, presented at the Main Conference at EMNLP 2024 (Miami). DecorateLM offers a systematic approach to improve pretraining corpora for Large Language Models (LLMs) through innovative data engineering techniques.

DecorateLM is an open-source toolkit designed to:

Rate: Assess texts against predefined quality criteria, filtering lower-quality data.
Tag: Organize data with hierarchical labels for better categorization and structured training.
Edit: Refine and standardize text formats to ensure consistency and clarity.

By optimizing the quality of pretraining data, DecorateLM enhances model robustness and adaptability across diverse tasks.

The repository provides open-source implementations for each annotation phase—rating, tagging, and editing. The DecorateLM model will soon be available on Hugging Face, empowering researchers and developers to seamlessly integrate high-quality data refinement into their LLM training pipelines.

Annotation

Rating

To run the complete scoring pipeline, execute the main Bash script:

bash run_scoring_pipeline.sh

Each step in the pipeline performs a specific function, as outlined below.

Scripts

Random Pair Sampling (random_pairs_sampler.py): This script generates random pairs for comparison from the input data file.
GPT Annotation (rating_annotater.py): This script uses GPT to perform pairwise comparison to each data, generating a winner for each data.
Bradley-Terry Scoring (score.py): This script calculates the scores for each item based on the Bradley-Terry model.

Configuration Options

The run_scoring_pipeline.sh script contains the following configurable options:

Input and Output Paths
- INPUT_PATH: Path to the input JSONL file containing items for comparison.
- PAIRS_IDX_PATH: Path to save the sampled pairs.
- GPT_ANNOTATION_PATH: Path to save the GPT-generated annotations.
- FINAL_OUTPUT_PATH: Path to save the final scores.
Random Pair Sampler Parameters
- N: Number of items to sample from the input data.
- N_PAIRS: Number of pairs to generate for comparison.
- COMPARE: Minimum number of comparisons per item.
- RANDOM_SEED: Seed for reproducibility.
GPT Annotation Parameters
- KEYS: Key(s) in JSON for content to be annotated, separated by commas.
- BATCH_SIZE: Number of pairs in each batch for annotation.
- PROMPTS_PATH: Path to the directory containing annotation prompts.
- TASK: Rating criterion used in annotation (e.g., educational_value, expertise, etc.).
Bradley-Terry Model Parameters
- LR: Learning rate for the scoring model.
- ITERATIONS: Number of iterations for optimization.
- SCORE_TYPE: Method for score calculation (bradley_terry or uniform).

Available Rating Criteria

The following rating criteria are supported:

fact&trivia: Rates items based on factual accuracy and trivia quality.
expertise: Rates items on the level of expertise required or demonstrated.
educational_value: Rates items based on their educational worth.
scarcity: Rates items based on rarity or scarcity of information.
reasoning_level: Rates items based on the complexity of reasoning.
structural_format: Rates items based on their structural clarity and format.
story-like: Rates items based on their narrative or story-like quality.
subjectivity: Rates items based on subjectivity or personal opinion.

To use a rating criterion, set the TASK variable in the Bash script to one of the criteria listed above.

Tagging

[Details coming soon.]

Editing

[Details coming soon.]

Decorated Corpus

[Details coming soon]

DecorateLM model

[Details coming soon]

Citation

If you find DecorateLM helpful in your research, please cite our paper:

@inproceedings{zhao2024decoratelm,
  title={DecorateLM: Data Engineering through Corpus Rating, Tagging, and Editing with Language Models},
  author={Zhao, Ranchi and Thai, Zhen and Zhang, Yifan and Hu, Shengding and Zhou, Jie and Ba, Yunqi and Cai, Jie and Liu, Zhiyuan and Sun, Maosong},
  booktitle={Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing},
  pages={1401--1418},
  year={2024}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

DecorateLM

Annotation

Rating

Scripts

Configuration Options

Available Rating Criteria

Tagging

Editing

Decorated Corpus

DecorateLM model

Citation

Files

README.md

Latest commit

History

README.md

File metadata and controls

DecorateLM

Annotation

Rating

Scripts

Configuration Options

Available Rating Criteria

Tagging

Editing

Decorated Corpus

DecorateLM model

Citation