Evaluation with `OnlineDPO` #2464

MohamedAliRashad · 2024-12-11T22:00:30Z

System Info

Copy-paste the following information when reporting an issue:

Platform: Linux-5.15.0-126-generic-x86_64-with-glibc2.35
Python version: 3.10.12
PyTorch version: 2.5.1
CUDA device(s): NVIDIA H100 80GB HBM3, NVIDIA H100 80GB HBM3, NVIDIA H100 80GB HBM3, NVIDIA H100 80GB HBM3, NVIDIA H100 80GB HBM3, NVIDIA H100 80GB HBM3, NVIDIA H100 80GB HBM3, NVIDIA H100 80GB HBM3
Transformers version: 4.46.0
Accelerate version: 1.1.1
Accelerate config: not found
Datasets version: 2.21.0
HF Hub version: 0.26.3
TRL version: 0.12.1
bitsandbytes version: not installed
DeepSpeed version: not installed
Diffusers version: not installed
Liger-Kernel version: not installed
LLM-Blender version: 0.0.2
OpenAI version: 1.57.1
PEFT version: not installed

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder
My own task or dataset (give details below)

Reproduction

from trl import OnlineDPOConfig, OnlineDPOTrainer
from transformers import AutoModelForCausalLM, AutoTokenizer, AutoModelForSequenceClassification

# Load model to be trained
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2-0.5B-Instruct")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-0.5B-Instruct")

# Load reward model
reward_model = AutoModelForSequenceClassification.from_pretrained("trl-lib/Qwen2-0.5B-Reward", num_labels=1)
reward_tokenizer = AutoTokenizer.from_pretrained("trl-lib/Qwen2-0.5B-Reward")

# Load dataset
ds = load_dataset("trl-lib/ultrafeedback-prompt", split="train")
ds = ds.train_test_split(test_size=0.1, seed=42)

training_args = OnlineDPOConfig(
    output_dir=str(Path(__file__).parent / f"online_dpo_checkpoints"),
    num_train_epochs=10,
    overwrite_output_dir=True,
    eval_strategy="steps",
    save_strategy="steps",
    report_to="none",
    save_steps=5,
    eval_steps=5,
    # weight_decay=0.01,   # Add weight decay
    # warmup_steps=2,    # Add warmup
    gradient_checkpointing=True,  # Great for memory saving
    gradient_checkpointing_kwargs={"use_reentrant": False},  # To remove the warning
    per_device_train_batch_size=2,
    gradient_accumulation_steps=1,
    per_device_eval_batch_size=2,
    save_total_limit=1,
    bf16=True,
    logging_steps=1,
    dataloader_num_workers=8,  # Use multiple processes for data loading
    dataloader_pin_memory=True,
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",
    ddp_find_unused_parameters=False,  # Don't know what this does
    remove_unused_columns=False,
    max_new_tokens=128,
    missing_eos_penalty=1.0,
    max_grad_norm=1.0,
    eval_delay=0.1,
    optim="adamw_torch_fused",
)

trainer = OnlineDPOTrainer(
    model=model,
    processing_class=tokenizer,
    reward_model=reward_model,
    reward_processing_class=reward_tokenizer,
    args=training_args,
    train_dataset=ds["train"],
    eval_dataset=ds["test"],
    # data_collator=data_collator,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=3)],
)

outputs:

    raise ValueError("You must specify exactly one of input_ids or inputs_embeds")
ValueError: You must specify exactly one of input_ids or inputs_embeds

This problem happens when i go into evaluation

Expected behavior

Evaluation should work

Checklist

I have checked that my issue isn't already filed (see open issues)
I have included my system information
Any code provided is minimal, complete, and reproducible (more on MREs)
Any code provided is properly formatted in code blocks, (no screenshot, more on code blocks)
Any traceback provided is complete

The text was updated successfully, but these errors were encountered:

qgallouedec added 🐛 bug Something isn't working 🏋 Online DPO Related to Online DPO labels Dec 13, 2024

qgallouedec linked a pull request Dec 13, 2024 that will close this issue

Allow eval in Online DPO #2476

Draft

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluation with `OnlineDPO` #2464

Evaluation with `OnlineDPO` #2464

MohamedAliRashad commented Dec 11, 2024

Evaluation with OnlineDPO #2464

Evaluation with OnlineDPO #2464

Comments

MohamedAliRashad commented Dec 11, 2024

System Info

Information

Tasks

Reproduction

Expected behavior

Checklist

Evaluation with `OnlineDPO` #2464

Evaluation with `OnlineDPO` #2464