Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New design for launch/run #40086

Merged
merged 15 commits into from
Mar 15, 2022
Merged

New design for launch/run #40086

merged 15 commits into from
Mar 15, 2022

Conversation

kuizhiqing
Copy link
Member

PR types

New features

PR changes

Others

Describe

This pr add new launch method for paddle, including elastic support.

More details in python/paddle/distributed/run/init.py

@paddle-bot-old
Copy link

paddle-bot-old bot commented Mar 2, 2022

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

python/paddle/distributed/run/__init__.py Show resolved Hide resolved
"--rank", type=int, default=-1, help="the peer rank")

base_group.add_argument(
"--log", type=str, default="INFO", help="log level. Default INFO")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think log_level or verbose is more meaningful.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point

"--np",
type=str,
default="1",
help="the number of peers, i.e. pod/node number")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In MPI, np means the total number of processes instead of number of nodes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think it is better to use another name.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, replace with nnodes later

Copy link
Contributor

@aoyulong aoyulong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, everything looks good to me. But something may needs to be improved when porting auto parallel.

"--np",
type=str,
default="1",
help="the number of peers, i.e. pod/node number")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think it is better to use another name.

sys.exit(sigint)


class Controller(ControllerBase):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need both Controller and ControllerBase as the base class?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ControllerBase refer to really basic workflow, Controller holds more API. Nothing but separation.

def replicas(self, replicas):
self._replicas = replicas

def set_replicas(self, np: str):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the replicas also mean the number of nodes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A little diff, replicas refers to logical nodes number, since one nodes may deploy two or more replicas.



if __name__ == '__main__':
cli = PKVClient("http://localhost:8090")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PKVClient->KVClient?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never Mind, just for local testing.

return 0

def deploy(self):
for i in self._init_containers:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If my understanding is right, the init_containers and conatiners are called by this launch module and users don't need to do the relaunch in their own scripts.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly


assert self.client.wait_server_ready(timeout=600), 'server is not ready'

# 'aaaaaa' make suer main pod (master server) as rank 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make suer > make sure ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

help="seconds to wait before elastic perform training")
return parser.parse_args()

def _valide_env(self, key):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

valide -> valid ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

return parser.parse_args()

def _valide_env(self, key):
if key in ['POD_IP']:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

POD_IP(历史遗留)是k8s里的概念,用在这里其实并不是特别合适,是否要用一个新的名称,同时保留兼容POD_IP,后续逐步废弃POD_IP?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

是的,不过目前 role_maker 里面用了,目前需要兼容,后续可以考虑替换

def set_env_in_args(self):
env_args = {
'POD_IP': 'host',
'PADDLE_MASTER': 'master',
Copy link
Contributor

@xymyeah xymyeah Mar 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

相同名称的环境变名称字符串分散在多个py文件中,是否需要将相关环境变量统一定义?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

目前所有除device相关的环境变量都统一在这里了,且在这里定义 args 和 env 的对应关系

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

可以在check一下,"PADDLE_MASTER"�在其他地方也出现了�

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

出现的另一个 PADDLE_MASTER 并不是同一个,后者取决于 fleet api 里面的定义。launch/run 模块用到的环境变量在 context。

Copy link

@sandyhouse sandyhouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

# python -m paddle.distributed.run train.py

use specified devices
# python -m paddle.distributed.run --devices=0,1,2,3 train.py
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the difference between "paddle.distributed.launch fleetrun" and "paddle.distributed.run"? Why not reuse launch

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As some args may not compatible with launch, we use run in this version to avoid breaking change.
In the new design, run module is a self contained directory, which can replace launch by renaming in the stable version.

Copy link
Contributor

@XiaoguangHu01 XiaoguangHu01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

在代码里注释说明一下TODO
正式发布时需要重点review下兼容性情况。

Copy link
Contributor

@XieYunshen XieYunshen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for set_tests_properties(test_run PROPERTIES TIMEOUT 200)

@sandyhouse sandyhouse merged commit 67c6ddf into PaddlePaddle:develop Mar 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants