Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[XPU] fix the dataloader problem in RDMA env #54150

Merged
merged 7 commits into from
Jun 28, 2023

Conversation

XiaociZhang
Copy link
Contributor

@XiaociZhang XiaociZhang commented May 27, 2023

PR types

Bug fixes

PR changes

Others

Description

When running multi-machine training with Paddle DataLoader, an unexpected segmentfault will be raised in dataloader process, where the traceback goes all back to a runtime error that dataloader workers exit unexpectedly. Similar problems have been discussed that lead to a misbehavior of OpenCV working in multiprocessing environment.
See
https://stackoverflow.com/questions/54013846/pytorch-dataloader-stucked-if-using-opencv-resize-method

When running multi-machine training with Paddle DataLoader, an
unexpected segmentfault will be raised in DataLoader Process,
where the traceback goes all back to a runtime error that dataloader
workers exit unexpectedly. Similar problems have been discussed
that lead to a misbehavior of OpenCV working in multiprocessing
environment.
See
https://stackoverflow.com/questions/54013846/pytorch-dataloader-stucked-if-using-opencv-resize-method
@paddle-bot
Copy link

paddle-bot bot commented May 27, 2023

你的PR提交成功,感谢你对开源项目的贡献!
请关注后续CI自动化测试结果,详情请参考Paddle-CI手册
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@paddle-bot paddle-bot bot added contributor External developers status: proposed labels May 27, 2023
@paddle-bot
Copy link

paddle-bot bot commented May 27, 2023

✅ This PR's description meets the template requirements!
Please wait for other CI results.

@XiaociZhang
Copy link
Contributor Author

[Unresolved] whether gpu has a similar problem. 王贤明 recalled that gpu also have this issue, but no conclusive evidents were ever-present.

@kuizhiqing kuizhiqing self-requested a review May 31, 2023 15:58
@paddle-ci-bot
Copy link

paddle-ci-bot bot commented Jun 7, 2023

Sorry to inform you that 9154033's CIs have passed for more than 7 days. To prevent PR conflicts, you need to re-run all CIs manually.

spawn method raise error 'Can't pickle local object' in some situations
@XiaociZhang XiaociZhang changed the title [kunlun] fix the dataloader problem in RDMA env [XPU] fix the dataloader problem in RDMA env Jun 26, 2023
Copy link
Contributor

@QingshuChen QingshuChen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@kuizhiqing kuizhiqing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@houj04 houj04 merged commit 15c8752 into PaddlePaddle:develop Jun 28, 2023
@paddle-bot
Copy link

paddle-bot bot commented Jun 28, 2023

你的PR已合入Paddle库,请关注后续测试结果。
Your PR has been merged into the repository. An official integration test will be conducted later. Stay tuned.

@XiaociZhang XiaociZhang deleted the dataloader branch June 28, 2023 03:22
XiaociZhang added a commit to XiaociZhang/Paddle that referenced this pull request Jul 4, 2023
QingshuChen pushed a commit that referenced this pull request Jul 6, 2023
cqulilujia pushed a commit to cqulilujia/Paddle that referenced this pull request Jul 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
contributor External developers
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants