-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Auto Parallel] Compatible new comm library upgrade for XPUs. #63817
Conversation
你的PR提交成功,感谢你对开源项目的贡献! |
if core.is_compiled_with_xpu(): | ||
dev._dtype = DeviceType.XPU | ||
else: | ||
dev._dtype = DeviceType.GPU | ||
visible_devices = os.getenv("CUDA_VISIBLE_DEVICES") | ||
elif 'XPU_VISIBLE_DEVICES' in os.environ: | ||
dev._dtype = DeviceType.XPU |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
可能要再加一个XPULINK_VISIBLE_DEVICES
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PP8时,需要设置:
export CUDA_DEVICE_ORDER=OAM_ID
export XPULINK_VISIBLE_DEVICES=2,3,0,1,5,4,7,6
然而,会导致rank0对应dev2,进而导致通信库无法正常工作,由 @Thunderbrook 进行问题的排查,以下代码进行相关说明:
这里传入的devices是模型启动脚本里面的--xpus,为0,1,2,3,4,5,6,7:
selected_dev_list = self.ctx.node.device.get_selected_devices( - https://github.com/PaddlePaddle/Paddle/blob/827f362/python/paddle/distributed/launch/context/args_envs.py#L138
这里的device._labels是从XPULINK_VISIBLE_DEVICES解析,为2,3,0,1,5,4,7,6,get_selected_devices也为2,3,0,1,5,4,7,6,因此rank0为dev2
- https://github.com/PaddlePaddle/Paddle/blob/827f362/python/paddle/distributed/launch/context/node.py#L27
- https://github.com/PaddlePaddle/Paddle/blob/827f362/python/paddle/distributed/launch/context/device.py#L106
- https://github.com/PaddlePaddle/Paddle/blob/827f362/python/paddle/distributed/launch/context/device.py#L85
解决方案:
方案1:设置export XPULINK_VISIBLE_DEVICES=2,3,0,1,5,4,7,6
的同时,需要设置--xpus "2,3,0,1,5,4,7,6"
,这样使得rank0,仍然为dev0
方案2(推荐方案):训练参数去掉--xpus
综上,机内PP8需要:
- 设置环境变量
export CUDA_DEVICE_ORDER=OAM_ID
export XPULINK_VISIBLE_DEVICES=2,3,0,1,5,4,7,6
- 训练参数去掉
--xpus
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
方案2 is preferred
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
elif 'XPU_VISIBLE_DEVICES' in os.environ: | ||
dev._dtype = DeviceType.XPU | ||
visible_devices = os.getenv("XPU_VISIBLE_DEVICES") | ||
elif 'CUDA_VISIBLE_DEVICES' in os.environ: | ||
if core.is_compiled_with_xpu(): | ||
dev._dtype = DeviceType.XPU |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里后续考虑加个注释?因为不知道背景的人可能会觉得疑惑,为什么在XPU下面会刷CUDA的环境变量。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
好的,下个pr里面加上
…XPUs. (PaddlePaddle#63817)" This reverts commit 551afbc.
您的邮件已收到,我将尽快回信。谢谢您!Your email has been received and I will reply as soon as possible. Thank you!
|
PR Category
Communication Library
PR Types
New features
Description
根据#56604 pr,在xpu上适配新版静态图分布式通信库