Running tokenizer on dataset 速度逐渐变慢 #5443

xuyue1112 · 2024-09-15T13:29:32Z

Reminder

I have read the README and searched the existing issues.

System Info

llamafactory version: 0.9.1.dev0
Platform: Linux-5.15.120.bsk.2-amd64-x86_64-with-glibc2.31
Python version: 3.11.2
PyTorch version: 2.4.0 (GPU)
Transformers version: 4.45.0.dev0
Datasets version: 2.21.0
Accelerate version: 0.34.2
PEFT version: 0.12.0
TRL version: 0.9.6
GPU type: NVIDIA A800-SXM4-40GB

Reproduction

dataset

dataset: xxx
eval_dataset: xxx
template: qwen2_vl
cutoff_len: 4096
max_samples: 5000000
overwrite_cache: true
preprocessing_num_workers: 16

Expected behavior

训练过程中，Running tokenizer on dataset 的速度逐渐从几百 samples/s 下降到个位数。请教下可能是哪里有问题？

Others

无

The text was updated successfully, but these errors were encountered:

2. update mistral format function call 3. fix knapsack, may cause hiyouga#5443 4. avoid supervised examples wrongly truncation hiyouga#5426

2. fix knapsack, may cause hiyouga#5443 3. avoid supervised examples wrongly truncation

AlongWY · 2024-09-18T14:38:21Z

经过我的实际测试，#5458 应该解决了这个问题

Wiselnn570 · 2024-10-26T11:21:36Z

@AlongWY 我也遇到了同样的问题，但你这个应该是针对packing情况的，如果没有packing应该怎么改呢

经过我的实际测试，#5458 应该解决了这个问题

AlongWY · 2024-10-28T09:42:17Z

没有 packing 也会下降到个位数吗？按理说应该不会吧

github-actions bot added the pending This problem is yet to be addressed label Sep 15, 2024

AlongWY added a commit to AlongWY/LLaMA-Factory that referenced this issue Sep 17, 2024

1. support flatting_packing

558b983

2. update mistral format function call 3. fix knapsack, may cause hiyouga#5443 4. avoid supervised examples wrongly truncation hiyouga#5426

AlongWY mentioned this issue Sep 17, 2024

Flatting Packing / maybe fix #5443 and #5426 #5458

Open

2 tasks

AlongWY added a commit to AlongWY/LLaMA-Factory that referenced this issue Sep 18, 2024

1. support flat_packing

7cab73b

2. fix knapsack, may cause hiyouga#5443 3. avoid supervised examples wrongly truncation

hiyouga added wontfix This will not be worked on and removed pending This problem is yet to be addressed labels Dec 5, 2024

hiyouga closed this as not planned Won't fix, can't repro, duplicate, stale Dec 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running tokenizer on dataset 速度逐渐变慢 #5443

Running tokenizer on dataset 速度逐渐变慢 #5443

xuyue1112 commented Sep 15, 2024

AlongWY commented Sep 18, 2024

Wiselnn570 commented Oct 26, 2024 •

edited

Loading

AlongWY commented Oct 28, 2024

Running tokenizer on dataset 速度逐渐变慢 #5443

Running tokenizer on dataset 速度逐渐变慢 #5443

Comments

xuyue1112 commented Sep 15, 2024

Reminder

System Info

Reproduction

dataset

Expected behavior

Others

AlongWY commented Sep 18, 2024

Wiselnn570 commented Oct 26, 2024 • edited Loading

AlongWY commented Oct 28, 2024

Wiselnn570 commented Oct 26, 2024 •

edited

Loading