[XPU] fix the dataloader problem in RDMA env #54150

XiaociZhang · 2023-05-27T15:17:47Z

PR types

Bug fixes

PR changes

Others

Description

When running multi-machine training with Paddle DataLoader, an unexpected segmentfault will be raised in dataloader process, where the traceback goes all back to a runtime error that dataloader workers exit unexpectedly. Similar problems have been discussed that lead to a misbehavior of OpenCV working in multiprocessing environment.
See
https://stackoverflow.com/questions/54013846/pytorch-dataloader-stucked-if-using-opencv-resize-method

When running multi-machine training with Paddle DataLoader, an unexpected segmentfault will be raised in DataLoader Process, where the traceback goes all back to a runtime error that dataloader workers exit unexpectedly. Similar problems have been discussed that lead to a misbehavior of OpenCV working in multiprocessing environment. See https://stackoverflow.com/questions/54013846/pytorch-dataloader-stucked-if-using-opencv-resize-method

paddle-bot · 2023-05-27T15:17:52Z

你的PR提交成功，感谢你对开源项目的贡献!
请关注后续CI自动化测试结果，详情请参考Paddle-CI手册。
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

paddle-bot · 2023-05-27T15:17:55Z

✅ This PR's description meets the template requirements!
Please wait for other CI results.

XiaociZhang · 2023-05-27T15:24:09Z

[Unresolved] whether gpu has a similar problem. 王贤明 recalled that gpu also have this issue, but no conclusive evidents were ever-present.

paddle-ci-bot · 2023-06-07T03:17:48Z

Sorry to inform you that 9154033's CIs have passed for more than 7 days. To prevent PR conflicts, you need to re-run all CIs manually.

spawn method raise error 'Can't pickle local object' in some situations

QingshuChen

LGTM

kuizhiqing

lgtm

paddle-bot · 2023-06-28T03:14:17Z

你的PR已合入Paddle库，请关注后续测试结果。
Your PR has been merged into the repository. An official integration test will be conducted later. Stay tuned.

)" This reverts commit 15c8752.

This reverts commit 15c8752.

)" (PaddlePaddle#55150) This reverts commit 15c8752.

paddle-bot bot added contributor External developers status: proposed labels May 27, 2023

XiaociZhang added 2 commits May 27, 2023 15:26

code style

5ee1b12

fix 'RuntimeError: context has already been set'

9154033

kuizhiqing self-requested a review May 31, 2023 15:58

XiaociZhang added 2 commits June 26, 2023 11:45

Merge branch 'PaddlePaddle:develop' into dataloader

cea52df

Update dataloader_iter.py

1646bb8

spawn method raise error 'Can't pickle local object' in some situations

XiaociZhang changed the title ~~[kunlun] fix the dataloader problem in RDMA env~~ [XPU] fix the dataloader problem in RDMA env Jun 26, 2023

XiaociZhang added 2 commits June 26, 2023 13:49

code format check

6d00310

code style

16b7522

QingshuChen reviewed Jun 28, 2023

View reviewed changes

houj04 approved these changes Jun 28, 2023

View reviewed changes

kuizhiqing approved these changes Jun 28, 2023

View reviewed changes

houj04 merged commit 15c8752 into PaddlePaddle:develop Jun 28, 2023

paddle-bot bot added status: accepted and removed status: proposed labels Jun 28, 2023

XiaociZhang deleted the dataloader branch June 28, 2023 03:22

XiaociZhang added a commit to XiaociZhang/Paddle that referenced this pull request Jul 4, 2023

Revert "[XPU] fix the dataloader problem in RDMA env (PaddlePaddle#54150

358ec2c

)" This reverts commit 15c8752.

XiaociZhang mentioned this pull request Jul 5, 2023

Revert "[XPU] fix the dataloader problem in RDMA env (#54150)" #55150

Merged

QingshuChen pushed a commit that referenced this pull request Jul 6, 2023

Revert "[XPU] fix the dataloader problem in RDMA env (#54150)" (#55150)

86694ce

This reverts commit 15c8752.

XiaociZhang mentioned this pull request Jul 17, 2023

[Kunlun] update XCCL to 1.0.53.6 #55475

Merged

cqulilujia pushed a commit to cqulilujia/Paddle that referenced this pull request Jul 24, 2023

Revert "[XPU] fix the dataloader problem in RDMA env (PaddlePaddle#54150

044685f

)" (PaddlePaddle#55150) This reverts commit 15c8752.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[XPU] fix the dataloader problem in RDMA env #54150

[XPU] fix the dataloader problem in RDMA env #54150

XiaociZhang commented May 27, 2023 •

edited

Loading

paddle-bot bot commented May 27, 2023

paddle-bot bot commented May 27, 2023 •

edited

Loading

XiaociZhang commented May 27, 2023

paddle-ci-bot bot commented Jun 7, 2023

QingshuChen left a comment

kuizhiqing left a comment

paddle-bot bot commented Jun 28, 2023

[XPU] fix the dataloader problem in RDMA env #54150

[XPU] fix the dataloader problem in RDMA env #54150

Conversation

XiaociZhang commented May 27, 2023 • edited Loading

PR types

PR changes

Description

paddle-bot bot commented May 27, 2023

paddle-bot bot commented May 27, 2023 • edited Loading

XiaociZhang commented May 27, 2023

paddle-ci-bot bot commented Jun 7, 2023

QingshuChen left a comment

Choose a reason for hiding this comment

kuizhiqing left a comment

Choose a reason for hiding this comment

paddle-bot bot commented Jun 28, 2023

XiaociZhang commented May 27, 2023 •

edited

Loading

paddle-bot bot commented May 27, 2023 •

edited

Loading