Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(train): Fix step & destroy group #2168

Merged
merged 4 commits into from
Nov 29, 2023
Merged

fix(train): Fix step & destroy group #2168

merged 4 commits into from
Nov 29, 2023

Conversation

xingchensong
Copy link
Member

@xingchensong xingchensong commented Nov 27, 2023

  1. wenet 2.2.1 中 configs['step']从-1开始而不是0开始 https://github.com/wenet-e2e/wenet/blob/v2.2.1/wenet/bin/train.py#L223
  2. wenet 2.2.1 中 self.step只有在accum_grad处才+1 https://github.com/wenet-e2e/wenet/blob/v2.2.1/wenet/utils/executor.py#L109 (上述不同应该不影响训练结果(或者说影响不大)。configs['step'] 对学习率影响只有一步之差。 self.step纯用来打印log,不影响训练流程)
  3. dist.barrier() 移动到 new_group 之前,强制同步,减少超时出现的频率
  4. 增加 destroy_process_group,对已经完成使命的group_join进行析构,避免 超过 ulimit -u 中限制的 max user processes RuntimeError: Resource temporarily unavailable due to running out of threads (ulimit -u) jax-ml/jax#2685
Root Cause (first observed failure):
[0]:
  time      : 2023-11-28_05:28:31
  host      : hobot-job-10017872-task-1.hobot-job-10017872.project-2080ti-speech-smalltask.svc.cluster.local.
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 1554)
  error_file: /tmp/torchelastic_qjva9q5p/2023_fzr5792e/attempt_0/0/error.json
  traceback : Traceback (most recent call last):
    File "/usr/local/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
      return f(*args, **kwargs)
    File "wenet/bin/train.py", line 129, in main
      group_join = dist.new_group(backend="gloo",
    File "/usr/local/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 3335, in new_group
      pg = _new_process_group_helper(
    File "/usr/local/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 862, in _new_process_group_helper
      pg = ProcessGroupGloo(prefix_store, group_rank, group_size, timeout=timeout)
  RuntimeError: Resource temporarily unavailable

@xingchensong xingchensong marked this pull request as draft November 27, 2023 05:50
@xingchensong xingchensong changed the title fix(train): Fix log fix(train): Fix log step Nov 27, 2023
@xingchensong xingchensong changed the title fix(train): Fix log step fix(train): Fix step & destroy group Nov 28, 2023
@xingchensong xingchensong marked this pull request as ready for review November 29, 2023 02:20
@xingchensong
Copy link
Member Author

xingchensong commented Nov 29, 2023

之前第140epoch 必触发 Resource temporarily unavailable, 现在 destroy_process_group 后,140ep后均可以继续训练,且趋势和wenet v2.2.1 一致

image

@Mddct Mddct merged commit 1aadcf7 into main Nov 29, 2023
6 checks passed
@Mddct Mddct deleted the xcsong-fix-log branch November 29, 2023 02:32
@xingchensong
Copy link
Member Author

FYI:

6f510790de26825c43e171cb02934c8

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants