Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fleetrun launch in legacy mode #40568

Merged
merged 9 commits into from
Mar 21, 2022
Merged

Conversation

kuizhiqing
Copy link
Member

@kuizhiqing kuizhiqing commented Mar 15, 2022

PR types

New features

PR changes

Others

Describe

New launch module introduces new main features:

  • new architecture with strong ability in extension; make messy workflow even clean
  • new args master and nnodes which makes it easier to use distributed launch

This PR try to work in compatible way,

  • the cases can be handled by new launch will be handled new launch
  • the cases can NOT be handled by new launch will switch to legacy launch
if ctx.is_legacy_mode():
    # legacy mode
else:
    # new launch

Anyway, legacy mode can be directly active by adding --legacy=true.

新版 launch 模块主要包含以下更新:

  • 新的架构具备更强的拓展能力,让二次开发更简单;避免了杂乱处理分支
  • 通过添加 masternnodes 参数引入新的更加易用的启动方式

本次 PR 引入兼容性合并使得:

  • 新 launch 包含的处理方式即由新模块处理
  • 新 launch 不兼容的情况即转由原 launch 模块处理
if ctx.is_legacy_mode():
    # legacy mode
else:
    # new launch

在不可控情况下,用户可以通过添加 --legacy=true 直接启用原 launch 方式启动。

@paddle-bot-old
Copy link

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

ctx.args.nnodes = len(hosts)
ctx.logger.info('args reset by env PADDLE_TRAINER_ENDPOINTS\n{}'.format(
eps))

if 'DISTRIBUTED_TRAINER_ENDPOINTS' in ctx.envs:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

考虑一下兼容pdc的ip:port方式,防止出现端口冲突(pdc用全局端口)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已启用兼容模式


use specified devices
# python -m paddle.distributed.run --devices=0,1,2,3 train.py
# python -m paddle.distributed.launch --devices=0,1,2,3 train.py
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

如果是在平台上通过环境变量配置的,是对应新版本的逻辑还是对应老版本的逻辑,可以在check一下是否有兼容的问题

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

新版本统一 --devices, 否则使用老版本

if 'DISTRIBUTED_TRAINER_ENDPOINTS' in ctx.envs:
ctx.master = ctx.envs['DISTRIBUTED_TRAINER_ENDPOINTS'].split(',')[0]
eps = ctx.envs['DISTRIBUTED_TRAINER_ENDPOINTS'].split(',')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

除了兼容之外,在新版本中DISTRIBUTED_TRAINER_ENDPOINTS这个环境变量是否还在使用?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DISTRIBUTED_TRAINER_ENDPOINTS 为 pdc 专用,目前已启用兼容模式

Copy link
Contributor

@aoyulong aoyulong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

default=None,
help="the master/rendezvous server, ip:port")

base_group.add_argument(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

新增的参数legacy建议在增加一些解释

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

好的

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里可以说明一下legacy只用于内部调试,外部开发者不需要关心这两种模式的差别。

Copy link
Contributor

@xymyeah xymyeah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link

@sandyhouse sandyhouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@XiaoguangHu01 XiaoguangHu01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

请更新一下API使用文档和示例

default=None,
help="the master/rendezvous server, ip:port")

base_group.add_argument(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里可以说明一下legacy只用于内部调试,外部开发者不需要关心这两种模式的差别。

Copy link
Contributor

@XieYunshen XieYunshen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for set_tests_properties(test_run PROPERTIES TIMEOUT 120)

@sandyhouse sandyhouse merged commit c54c60d into PaddlePaddle:develop Mar 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants