-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fleetrun launch in legacy mode #40568
Conversation
Thanks for your contribution! |
ctx.args.nnodes = len(hosts) | ||
ctx.logger.info('args reset by env PADDLE_TRAINER_ENDPOINTS\n{}'.format( | ||
eps)) | ||
|
||
if 'DISTRIBUTED_TRAINER_ENDPOINTS' in ctx.envs: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
考虑一下兼容pdc的ip:port方式,防止出现端口冲突(pdc用全局端口)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已启用兼容模式
|
||
use specified devices | ||
# python -m paddle.distributed.run --devices=0,1,2,3 train.py | ||
# python -m paddle.distributed.launch --devices=0,1,2,3 train.py |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
如果是在平台上通过环境变量配置的,是对应新版本的逻辑还是对应老版本的逻辑,可以在check一下是否有兼容的问题
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
新版本统一 --devices, 否则使用老版本
if 'DISTRIBUTED_TRAINER_ENDPOINTS' in ctx.envs: | ||
ctx.master = ctx.envs['DISTRIBUTED_TRAINER_ENDPOINTS'].split(',')[0] | ||
eps = ctx.envs['DISTRIBUTED_TRAINER_ENDPOINTS'].split(',') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
除了兼容之外,在新版本中DISTRIBUTED_TRAINER_ENDPOINTS这个环境变量是否还在使用?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DISTRIBUTED_TRAINER_ENDPOINTS 为 pdc 专用,目前已启用兼容模式
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
default=None, | ||
help="the master/rendezvous server, ip:port") | ||
|
||
base_group.add_argument( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
新增的参数legacy建议在增加一些解释
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
好的
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里可以说明一下legacy只用于内部调试,外部开发者不需要关心这两种模式的差别。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
请更新一下API使用文档和示例
default=None, | ||
help="the master/rendezvous server, ip:port") | ||
|
||
base_group.add_argument( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里可以说明一下legacy只用于内部调试,外部开发者不需要关心这两种模式的差别。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM for set_tests_properties(test_run PROPERTIES TIMEOUT 120)
PR types
New features
PR changes
Others
Describe
New launch module introduces new main features:
This PR try to work in compatible way,
Anyway, legacy mode can be directly active by adding --legacy=true.
新版 launch 模块主要包含以下更新:
本次 PR 引入兼容性合并使得:
在不可控情况下,用户可以通过添加 --legacy=true 直接启用原 launch 方式启动。