[Trainer] Support skip data intervals #8989

greycooker · 2024-08-22T06:49:21Z

PR types

New Feature

PR changes

Support skip data intervals

Description

New training_arg skip_data_intervals, refer to the data intervals to skip, the training process will pass the data from start global step to end global step at each interval.

paddle-bot · 2024-08-22T06:49:26Z

Thanks for your contribution!

codecov · 2024-08-22T07:24:04Z

Codecov Report

Attention: Patch coverage is 35.00000% with 39 lines in your changes missing coverage. Please review.

Project coverage is 53.24%. Comparing base (e340457) to head (1cdbf1d).
Report is 14 commits behind head on develop.

Files with missing lines	Patch %	Lines
paddlenlp/trainer/trainer.py	28.57%	30 Missing ⚠️
paddlenlp/trainer/trainer_utils.py	25.00%	9 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #8989      +/-   ##
===========================================
- Coverage    53.26%   53.24%   -0.02%     
===========================================
  Files          652      652              
  Lines       105587   105639      +52     
===========================================
+ Hits         56237    56252      +15     
- Misses       49350    49387      +37

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

ZHUI · 2024-08-27T03:16:33Z

paddlenlp/trainer/trainer.py


-        if args.recompute:
+        if not args.debug_data:


额，debug_data 是模型啥的都不跑是吗？

这个有必要对完暴露吗？还是开发完了，删掉？

对，debug_data就是只打印数据不加载模型，而且也不训练，这里是想作为一个通用功能加进来。

如果只是我们内部使用的debug模式的话，我感觉加的意义不是很大。

ZHUI · 2024-08-27T03:19:08Z

paddlenlp/trainer/trainer.py

+                # Skip data
+                if should_skip_data(self.state.global_step, self.args.skip_data_intervals):
+                    logger.warning(f"Skip data at global step {self.state.global_step+1}, sub step {step_control}")
+                    logger.warning(f"{self.tokenizer.batch_decode(inputs['input_ids'], skip_special_tokens=True)}")


这个就不要加了吧

Suggested change

logger.warning(f"{self.tokenizer.batch_decode(inputs['input_ids'], skip_special_tokens=True)}")

这个warning是用来打印跳过的数据的，如果去掉的话也是OK的，这里主要是想让用户知道跳过的数据都是啥。

ZHUI · 2024-08-27T03:19:35Z

paddlenlp/trainer/trainer.py

+                        self.state.global_step += 1
+                        self.state.epoch = epoch + (step + 1) / steps_in_epoch
+                        self.control = self.callback_handler.on_step_end(args, self.state, self.control)
+                        self._maybe_log_save_evaluate(tr_loss, model, epoch, ignore_keys_for_eval, inputs=inputs)


这个也不需要了吧？

_maybe_log_save_evaluate这里是为了去走：
1.tr_loss的重置：

PaddleNLP/paddlenlp/trainer/trainer.py

Line 1308 in 48820cb

tr_loss.subtract_(tr_loss)

2._globalstep_last_logged的更新：

PaddleNLP/paddlenlp/trainer/trainer.py

Line 1346 in 48820cb

self._globalstep_last_logged = self.state.global_step

3.正常的eval流程。不然最后eval计算consumed_samples的时候会有问题https://github.com/PaddlePaddle/PaddleNLP/blob/48820cbc1fe986004f817c0517886735675732d2/paddlenlp/trainer/trainer.py#L2792C6-L2797C18

我主要的担心的是，skip数据的时候，碰到了eval 或者 save 等各种各样的call back 是否有问题。
还是说，我们这里可以只处理数据，其他一律不触发。当然 step之类的更新加上。

ZHUI · 2024-08-27T03:20:31Z

paddlenlp/trainer/trainer.py

+                        step_control += 1
+                    if self.control.should_epoch_stop or self.control.should_training_stop:
+                        break
+                    self.timers and self.timers("read-data").start()


我感觉很多东西你可能不需要啊，没有计算的话，一些call_back 触发不知道有没有问题？

这里是为了进行一些判断，比如是否应该进行eval、save和停止训练。没有经过前反向计算直接执行callback我测试的时候没有报错，不过可能确实会有一些没测试到的潜在风险。
https://github.com/PaddlePaddle/PaddleNLP/blob/48820cbc1fe986004f817c0517886735675732d2/paddlenlp/trainer/trainer_callback.py#L432C1-L460C23

ZHUI

LGTM

paddlenlp/trainer/trainer.py

* support skip data intervals * add debug_data arg * fix loss compute * remove callback while skip data * remove debug data * add callback_handler * remove debug_data * fix conflict

greycooker added 3 commits August 22, 2024 06:46

support skip data intervals

f8840bd

add debug_data arg

8b2cc1d

fix loss compute

f75a6dd

ZHUI reviewed Aug 27, 2024

View reviewed changes

greycooker and others added 7 commits September 5, 2024 23:50

Merge branch 'PaddlePaddle:develop' into support_skip_intervals

224ce88

remove callback while skip data

9dd33a5

Merge branch 'PaddlePaddle:develop' into support_skip_intervals

435586a

remove debug data

fb407d8

add callback_handler

67ef207

Merge branch 'PaddlePaddle:develop' into support_skip_intervals

f2e7a31

remove debug_data

f7cef77

ZHUI changed the title ~~Support skip data intervals~~ [Trainer] Support skip data intervals Sep 13, 2024

greycooker and others added 2 commits September 18, 2024 02:49

fix conflict

b06f856

Merge branch 'develop' into support_skip_intervals

1cdbf1d

ZHUI approved these changes Sep 19, 2024

View reviewed changes

paddlenlp/trainer/trainer.py Outdated Show resolved Hide resolved

gongel approved these changes Sep 19, 2024

View reviewed changes

sijunhe merged commit ad14dc4 into PaddlePaddle:develop Sep 23, 2024
7 of 12 checks passed

greycooker mentioned this pull request Sep 23, 2024

[Cherry-pick]Support skip data intervals #9174

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Trainer] Support skip data intervals #8989

[Trainer] Support skip data intervals #8989

greycooker commented Aug 22, 2024

paddle-bot bot commented Aug 22, 2024

codecov bot commented Aug 22, 2024 •

edited

Loading

ZHUI Aug 27, 2024

greycooker Aug 27, 2024 •

edited

Loading

ZHUI Aug 27, 2024

ZHUI Aug 27, 2024

greycooker Aug 27, 2024 •

edited

Loading

ZHUI Aug 27, 2024

greycooker Aug 27, 2024 •

edited

Loading

ZHUI Aug 27, 2024

ZHUI Aug 27, 2024

greycooker Aug 27, 2024

ZHUI left a comment

[Trainer] Support skip data intervals #8989

[Trainer] Support skip data intervals #8989

Conversation

greycooker commented Aug 22, 2024

PR types

PR changes

Description

paddle-bot bot commented Aug 22, 2024

codecov bot commented Aug 22, 2024 • edited Loading

Codecov Report

ZHUI Aug 27, 2024

Choose a reason for hiding this comment

greycooker Aug 27, 2024 • edited Loading

Choose a reason for hiding this comment

ZHUI Aug 27, 2024

Choose a reason for hiding this comment

ZHUI Aug 27, 2024

Choose a reason for hiding this comment

greycooker Aug 27, 2024 • edited Loading

Choose a reason for hiding this comment

ZHUI Aug 27, 2024

Choose a reason for hiding this comment

greycooker Aug 27, 2024 • edited Loading

Choose a reason for hiding this comment

ZHUI Aug 27, 2024

Choose a reason for hiding this comment

ZHUI Aug 27, 2024

Choose a reason for hiding this comment

greycooker Aug 27, 2024

Choose a reason for hiding this comment

ZHUI left a comment

Choose a reason for hiding this comment

codecov bot commented Aug 22, 2024 •

edited

Loading

greycooker Aug 27, 2024 •

edited

Loading

greycooker Aug 27, 2024 •

edited

Loading

greycooker Aug 27, 2024 •

edited

Loading