Skip to content

Latest commit

 

History

History
106 lines (70 loc) · 4.33 KB

README.md

File metadata and controls

106 lines (70 loc) · 4.33 KB

DecorateLM

This is the official repository for DecorateLM: Data Engineering through Corpus Rating, Tagging, and Editing with Language Models, presented at the Main Conference at EMNLP 2024 (Miami). DecorateLM offers a systematic approach to improve pretraining corpora for Large Language Models (LLMs) through innovative data engineering techniques.

DecorateLM is an open-source toolkit designed to:

  • Rate: Assess texts against predefined quality criteria, filtering lower-quality data.
  • Tag: Organize data with hierarchical labels for better categorization and structured training.
  • Edit: Refine and standardize text formats to ensure consistency and clarity.

By optimizing the quality of pretraining data, DecorateLM enhances model robustness and adaptability across diverse tasks.

The repository provides open-source implementations for each annotation phase—rating, tagging, and editing. The DecorateLM model will soon be available on Hugging Face, empowering researchers and developers to seamlessly integrate high-quality data refinement into their LLM training pipelines.

Annotation

Rating

To run the complete scoring pipeline, execute the main Bash script:

bash run_scoring_pipeline.sh

Each step in the pipeline performs a specific function, as outlined below.

Scripts

  • Random Pair Sampling (random_pairs_sampler.py): This script generates random pairs for comparison from the input data file.
  • GPT Annotation (rating_annotater.py): This script uses GPT to perform pairwise comparison to each data, generating a winner for each data.
  • Bradley-Terry Scoring (score.py): This script calculates the scores for each item based on the Bradley-Terry model.

Configuration Options

The run_scoring_pipeline.sh script contains the following configurable options:

  • Input and Output Paths

    • INPUT_PATH: Path to the input JSONL file containing items for comparison.
    • PAIRS_IDX_PATH: Path to save the sampled pairs.
    • GPT_ANNOTATION_PATH: Path to save the GPT-generated annotations.
    • FINAL_OUTPUT_PATH: Path to save the final scores.
  • Random Pair Sampler Parameters

    • N: Number of items to sample from the input data.
    • N_PAIRS: Number of pairs to generate for comparison.
    • COMPARE: Minimum number of comparisons per item.
    • RANDOM_SEED: Seed for reproducibility.
  • GPT Annotation Parameters

    • KEYS: Key(s) in JSON for content to be annotated, separated by commas.

    • BATCH_SIZE: Number of pairs in each batch for annotation.

    • PROMPTS_PATH: Path to the directory containing annotation prompts.

    • TASK: Rating criterion used in annotation (e.g., educational_value, expertise, etc.).

  • Bradley-Terry Model Parameters

    • LR: Learning rate for the scoring model.
    • ITERATIONS: Number of iterations for optimization.
    • SCORE_TYPE: Method for score calculation (bradley_terry or uniform).

Available Rating Criteria

The following rating criteria are supported:

  • fact&trivia: Rates items based on factual accuracy and trivia quality.
  • expertise: Rates items on the level of expertise required or demonstrated.
  • educational_value: Rates items based on their educational worth.
  • scarcity: Rates items based on rarity or scarcity of information.
  • reasoning_level: Rates items based on the complexity of reasoning.
  • structural_format: Rates items based on their structural clarity and format.
  • story-like: Rates items based on their narrative or story-like quality.
  • subjectivity: Rates items based on subjectivity or personal opinion.

To use a rating criterion, set the TASK variable in the Bash script to one of the criteria listed above.

Tagging

[Details coming soon.]

Editing

[Details coming soon.]

Decorated Corpus

[Details coming soon]

DecorateLM model

[Details coming soon]

Citation

If you find DecorateLM helpful in your research, please cite our paper:

@inproceedings{zhao2024decoratelm,
  title={DecorateLM: Data Engineering through Corpus Rating, Tagging, and Editing with Language Models},
  author={Zhao, Ranchi and Thai, Zhen and Zhang, Yifan and Hu, Shengding and Zhou, Jie and Ba, Yunqi and Cai, Jie and Liu, Zhiyuan and Sun, Maosong},
  booktitle={Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing},
  pages={1401--1418},
  year={2024}
}