-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[XPU] fix the dataloader problem in RDMA env #54150
Conversation
When running multi-machine training with Paddle DataLoader, an unexpected segmentfault will be raised in DataLoader Process, where the traceback goes all back to a runtime error that dataloader workers exit unexpectedly. Similar problems have been discussed that lead to a misbehavior of OpenCV working in multiprocessing environment. See https://stackoverflow.com/questions/54013846/pytorch-dataloader-stucked-if-using-opencv-resize-method
你的PR提交成功,感谢你对开源项目的贡献! |
✅ This PR's description meets the template requirements! |
[Unresolved] whether gpu has a similar problem. 王贤明 recalled that gpu also have this issue, but no conclusive evidents were ever-present. |
Sorry to inform you that 9154033's CIs have passed for more than 7 days. To prevent PR conflicts, you need to re-run all CIs manually. |
spawn method raise error 'Can't pickle local object' in some situations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
你的PR已合入Paddle库,请关注后续测试结果。 |
)" This reverts commit 15c8752.
)" (PaddlePaddle#55150) This reverts commit 15c8752.
PR types
Bug fixes
PR changes
Others
Description
When running multi-machine training with Paddle DataLoader, an unexpected segmentfault will be raised in dataloader process, where the traceback goes all back to a runtime error that dataloader workers exit unexpectedly. Similar problems have been discussed that lead to a misbehavior of OpenCV working in multiprocessing environment.
See
https://stackoverflow.com/questions/54013846/pytorch-dataloader-stucked-if-using-opencv-resize-method