fleetrun launch in legacy mode #40568

kuizhiqing · 2022-03-15T08:04:02Z

PR types

New features

PR changes

Others

Describe

New launch module introduces new main features:

new architecture with strong ability in extension; make messy workflow even clean
new args master and nnodes which makes it easier to use distributed launch

This PR try to work in compatible way,

the cases can be handled by new launch will be handled new launch
the cases can NOT be handled by new launch will switch to legacy launch

if ctx.is_legacy_mode():
    # legacy mode
else:
    # new launch

Anyway, legacy mode can be directly active by adding --legacy=true.

新版 launch 模块主要包含以下更新：

新的架构具备更强的拓展能力，让二次开发更简单；避免了杂乱处理分支
通过添加 master 和 nnodes 参数引入新的更加易用的启动方式

本次 PR 引入兼容性合并使得：

新 launch 包含的处理方式即由新模块处理
新 launch 不兼容的情况即转由原 launch 模块处理

if ctx.is_legacy_mode():
    # legacy mode
else:
    # new launch

在不可控情况下，用户可以通过添加 --legacy=true 直接启用原 launch 方式启动。

paddle-bot-old · 2022-03-15T08:04:21Z

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

xymyeah · 2022-03-17T03:12:56Z

python/paddle/distributed/launch/plugins/__init__.py

+        ctx.args.nnodes = len(hosts)
+        ctx.logger.info('args reset by env PADDLE_TRAINER_ENDPOINTS\n{}'.format(
+            eps))
+
    if 'DISTRIBUTED_TRAINER_ENDPOINTS' in ctx.envs:


考虑一下兼容pdc的ip：port方式，防止出现端口冲突（pdc用全局端口）

已启用兼容模式

xymyeah · 2022-03-17T03:14:30Z

python/paddle/distributed/launch/__init__.py


 use specified devices
-# python -m paddle.distributed.run --devices=0,1,2,3 train.py
+# python -m paddle.distributed.launch --devices=0,1,2,3 train.py


如果是在平台上通过环境变量配置的，是对应新版本的逻辑还是对应老版本的逻辑，可以在check一下是否有兼容的问题

新版本统一 --devices, 否则使用老版本

xymyeah · 2022-03-17T03:19:38Z

python/paddle/distributed/launch/plugins/__init__.py

    if 'DISTRIBUTED_TRAINER_ENDPOINTS' in ctx.envs:
-        ctx.master = ctx.envs['DISTRIBUTED_TRAINER_ENDPOINTS'].split(',')[0]
+        eps = ctx.envs['DISTRIBUTED_TRAINER_ENDPOINTS'].split(',')


除了兼容之外，在新版本中DISTRIBUTED_TRAINER_ENDPOINTS这个环境变量是否还在使用？

DISTRIBUTED_TRAINER_ENDPOINTS 为 pdc 专用，目前已启用兼容模式

aoyulong

LGTM.

xymyeah · 2022-03-18T08:57:48Z

python/paddle/distributed/launch/context/args_envs.py

+        default=None,
+        help="the master/rendezvous server, ip:port")
+
+    base_group.add_argument(


新增的参数legacy建议在增加一些解释

这里可以说明一下legacy只用于内部调试，外部开发者不需要关心这两种模式的差别。

xymyeah

LGTM

sandyhouse

LGTM

XiaoguangHu01

请更新一下API使用文档和示例

XiaoguangHu01 · 2022-03-21T06:26:06Z

python/paddle/distributed/launch/context/args_envs.py

+        default=None,
+        help="the master/rendezvous server, ip:port")
+
+    base_group.add_argument(


这里可以说明一下legacy只用于内部调试，外部开发者不需要关心这两种模式的差别。

XieYunshen

LGTM for set_tests_properties(test_run PROPERTIES TIMEOUT 120)

run to launch in legacy mode

1d5b3ae

add mlu support; optim status

c7da5b5

fix default device count

41c3854

kuizhiqing force-pushed the launch-new branch from 8f90c02 to 41c3854 Compare March 16, 2022 09:30

kuizhiqing added 3 commits March 16, 2022 16:01

job_id; env compatible

591c5f8

device nproc and start port

127114f

fix c_comm_init_op test

be9e1da

xymyeah suggested changes Mar 17, 2022

View reviewed changes

kuizhiqing added 3 commits March 17, 2022 06:49

fix setup

66086d8

fix setup

c7db6dd

fix setup api

6b71984

aoyulong approved these changes Mar 18, 2022

View reviewed changes

xymyeah approved these changes Mar 18, 2022

View reviewed changes

sandyhouse approved these changes Mar 21, 2022

View reviewed changes

fuyinno4 approved these changes Mar 21, 2022

View reviewed changes

XiaoguangHu01 approved these changes Mar 21, 2022

View reviewed changes

XieYunshen approved these changes Mar 21, 2022

View reviewed changes

sandyhouse merged commit c54c60d into PaddlePaddle:develop Mar 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fleetrun launch in legacy mode #40568

fleetrun launch in legacy mode #40568

kuizhiqing commented Mar 15, 2022 •

edited

Loading

paddle-bot-old bot commented Mar 15, 2022

xymyeah Mar 17, 2022

kuizhiqing Mar 17, 2022

xymyeah Mar 17, 2022

kuizhiqing Mar 17, 2022

xymyeah Mar 17, 2022

kuizhiqing Mar 17, 2022

aoyulong left a comment

xymyeah Mar 18, 2022

kuizhiqing Mar 18, 2022

XiaoguangHu01 Mar 21, 2022

xymyeah left a comment

sandyhouse left a comment

XiaoguangHu01 left a comment

XiaoguangHu01 Mar 21, 2022

XieYunshen left a comment

fleetrun launch in legacy mode #40568

fleetrun launch in legacy mode #40568

Conversation

kuizhiqing commented Mar 15, 2022 • edited Loading

PR types

PR changes

Describe

paddle-bot-old bot commented Mar 15, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aoyulong left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xymyeah left a comment

Choose a reason for hiding this comment

sandyhouse left a comment

Choose a reason for hiding this comment

XiaoguangHu01 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

XieYunshen left a comment

Choose a reason for hiding this comment

kuizhiqing commented Mar 15, 2022 •

edited

Loading