-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AutoParallel] Add paddle.distributed.shard layer api #57604
[AutoParallel] Add paddle.distributed.shard layer api #57604
Conversation
… deve dtensor_from_fn first edition
… add_shard_layer_api 20230901 pull lastest code
… add_shard_layer_api
你的PR提交成功,感谢你对开源项目的贡献! |
|
||
def shard_layer( | ||
layer: nn.Layer, | ||
process_mesh: dist.ProcessMesh, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shard_layer
需要支持跨mesh的情况吗?不支持的话,replicate_layer_params_and_buffers
需不需要检查Tensor
的ProcessMesh
是否合法
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
理论上不应该限制是否跨mesh,不过当用户跨mesh shard_layer的时候,就不需要调用reshard了,这块在用户使用行为上还需要讨论下
不过,这里不需要检查mesh具体的状态,因为shard_layer是shard_tensor的包装,要检查的话在shard_tensor中检查更加合适,shard_tensor检查不合法就报错
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
明白了~
layer: nn.Layer, mesh: dist.ProcessMesh | ||
) -> None: | ||
for key, param in layer._parameters.items(): | ||
if param is not None and not param.is_dist(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
如果参数不是dist_tensor,就把它变成replicated的,如果我们后面再执行机制里加了转replicated的操作,这里还需要吗
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
后面API都是inplace转replicated的话,这里应该可以删除,效果一致。
不过也不冲突,前置转其实逻辑上更顺畅一些,inplace转多少还是隐式更改了用户的输入
|
||
class TestShardLayer(unittest.TestCase): | ||
def setUp(self): | ||
self.mesh = dist.ProcessMesh([0, 1], dim_names=["x"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个单测,没有用launch启,但是可以用两卡吗
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里没实际用mesh切分,下面的spec都是None,运行时tensor都是replicated,切分会跳过,所以单卡也能跑,主要是测了一遍流程
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.. code-block:: python
少了个空格
Co-authored-by: zachary sun <70642955+sunzhongkai588@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm for docs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
…7604) * def dtensor_from_fn first edition * dtensor_from_fn first edition * shard_layer api and utest(temporarily unavailable) * shard_layer API and unit test preliminary complete * complete the sample code modification according to ZhongKai's suggestion * modify according to the review * modify according to LiangGe's review * Not approved yet, temporarily stored * waiting for tensor to param * Complete the modifications according to Weihang's review * polish shard_layer api impl and doc * add shard layer test * rewrite unittest * revert needless change * polish doc * add unittest for coverage * add static branch and test * polish en doc * polish test details * verify doc test demo * Update python/paddle/distributed/auto_parallel/api.py Co-authored-by: zachary sun <70642955+sunzhongkai588@users.noreply.github.com> --------- Co-authored-by: yangxiaoyu14 <yangxiaoyu14@baidu.com> Co-authored-by: zachary sun <70642955+sunzhongkai588@users.noreply.github.com>
…7604) * def dtensor_from_fn first edition * dtensor_from_fn first edition * shard_layer api and utest(temporarily unavailable) * shard_layer API and unit test preliminary complete * complete the sample code modification according to ZhongKai's suggestion * modify according to the review * modify according to LiangGe's review * Not approved yet, temporarily stored * waiting for tensor to param * Complete the modifications according to Weihang's review * polish shard_layer api impl and doc * add shard layer test * rewrite unittest * revert needless change * polish doc * add unittest for coverage * add static branch and test * polish en doc * polish test details * verify doc test demo * Update python/paddle/distributed/auto_parallel/api.py Co-authored-by: zachary sun <70642955+sunzhongkai588@users.noreply.github.com> --------- Co-authored-by: yangxiaoyu14 <yangxiaoyu14@baidu.com> Co-authored-by: zachary sun <70642955+sunzhongkai588@users.noreply.github.com>
…7604) * def dtensor_from_fn first edition * dtensor_from_fn first edition * shard_layer api and utest(temporarily unavailable) * shard_layer API and unit test preliminary complete * complete the sample code modification according to ZhongKai's suggestion * modify according to the review * modify according to LiangGe's review * Not approved yet, temporarily stored * waiting for tensor to param * Complete the modifications according to Weihang's review * polish shard_layer api impl and doc * add shard layer test * rewrite unittest * revert needless change * polish doc * add unittest for coverage * add static branch and test * polish en doc * polish test details * verify doc test demo * Update python/paddle/distributed/auto_parallel/api.py Co-authored-by: zachary sun <70642955+sunzhongkai588@users.noreply.github.com> --------- Co-authored-by: yangxiaoyu14 <yangxiaoyu14@baidu.com> Co-authored-by: zachary sun <70642955+sunzhongkai588@users.noreply.github.com>
PR types
New features
PR changes
APIs
Description
Pcard-73145
[AutoParallel] Add paddle.distributed.shard layer api
The cn doc: PaddlePaddle/docs#6201