-
Notifications
You must be signed in to change notification settings - Fork 22.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add c10d dynamic loading mechanism and unit test #28068
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for adding this!!! This will be very useful for other backends as well.
@ftian1, could you please also include C10D related headers into PyTorch installation to have ability to build external ProcessGroup which depends from these headers?
|
@mrshenli @mshiryaev pls help to check the patch again. |
@mshiryaev @jgong5 @mrshenli I updated the patch according to you guys comments. thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test failures are real:
Oct 18 04:57:19 test_broadcast_coalesced_gloo_cuda (__main__.CommTest) ... THCudaCheck FAIL file=/var/lib/jenkins/workspace/aten/src/THC/THCGeneral.cpp line=54 error=3 : initialization error
Oct 18 04:57:19 THCudaCheck FAIL file=/var/lib/jenkins/workspace/aten/src/THC/THCGeneral.cpp line=54 error=3 : initialization error
Oct 18 04:57:19 ERROR:root:Caught exception:
Oct 18 04:57:19 Traceback (most recent call last):
Oct 18 04:57:19 File "/var/lib/jenkins/workspace/test/common_distributed.py", line 133, in wrapper
Oct 18 04:57:19 fn(self)
Oct 18 04:57:19 File "/var/lib/jenkins/workspace/test/common_distributed.py", line 35, in wrapper
Oct 18 04:57:19 return func(*args, **kwargs)
Oct 18 04:57:19 File "test_c10d.py", line 3198, in test_broadcast_coalesced_gloo_cuda
Oct 18 04:57:19 self._test_broadcast_coalesced(process_group, device)
Oct 18 04:57:19 File "test_c10d.py", line 3161, in _test_broadcast_coalesced
Oct 18 04:57:19 target = torch.arange(60, dtype=half, device=device).chunk(5)
Oct 18 04:57:19 File "/opt/conda/lib/python3.6/site-packages/torch/cuda/__init__.py", line 198, in _lazy_init
Oct 18 04:57:19 torch._C._cuda_init()
Oct 18 04:57:19 RuntimeError: cuda runtime error (3) : initialization error at /var/lib/jenkins/workspace/aten/src/THC/THCGeneral.cpp:54
I asked @gchanan offline regarding the above error. One potential cause is that the new cpp_extension code did not properly initialize cuda or triggered some error, which could lead to errors for subsequent tests.
Oct 18 04:57:19 test_collective_apis (__main__.BackendLoadingTest) ... ok
Oct 18 04:57:19 test_broadcast_coalesced_gloo_cpu (__main__.CommTest) ... ok
Oct 18 04:57:19 test_broadcast_coalesced_gloo_cuda (__main__.CommTest) ... THCudaCheck FAIL file=/var/lib/jenkins/workspace/aten/src/THC/THCGeneral.cpp line=54 error=3 : initialization error
To debug this, setting CUDA_LAUNCH_BLOCKING and cudaGetLastError()
might help.
@yf225 do you know what is the appropriate way to initialize cuda in cpp extensions? |
@mrshenli @yf225 I will continue to look into it, but if you guys have suggestions I would be appreciate for that. |
@ftian1 I feel that it might be worthwhile to look at the docs for pytorch/torch/utils/cpp_extension.py Lines 571 to 650 in be3808d
test/test_cpp_extensions.py regarding CUDA-related compilations.
|
@mrshenli @yf225 the root cause is because classes in test_c10d.py is derived from MultiProcessTestCase and invoke self._fork_processes() to start new processes. according to the search result in web, we have to use spawn way rather fork to generate new process and avoid cuda_init() error. But the embarrass thing is _spawn_processes() only works with python3. python 2 will assert at _spawn_processes() of test/common_distributed.py. could you give some comments on this? shall we update all fork() to spawn() and only allow py3 to trigger c10d test by adding it to py2 blacklist? |
@ftian1 Would it be possible to delay calling We'll keep Python 2 support around until we have officially deprecated it on January 1st (see #23795). |
@ftian1 Could you please also add a new init_method that explicitly allow backend to discover rank and world_size internally inside custom backend, something similar to what MPI backend does? I propose adding a new "auto" init_method that allows this like I am doing here in ddkalamk@84fddd8 |
💊 CircleCI build failures summary and remediationsAs of commit afe3c0b (more details on the Dr. CI page):
🕵️ 1 new failure recognized by patternsThe following build failures do not appear to be due to upstream breakages: pytorch_linux_xenial_py3_6_gcc5_4_build (1/1)Step: "Build" (full log | pattern match details) <confirmed not flaky by 2 failures>
|
420758c
to
4c8c67f
Compare
@mrshenli Thanks for @pietern suggestions, I modified the code to make "import cpp_extension" dely happens on a separate test process. it works on my local. The latest preci failures, such as "/temp/test_c10d file existed" or "nija not exist", look like test env issue rather my code bug. could you pls help review again? sorry for the so long PR thread... |
Given the following failure, it seems that the load has to be called in the parent process, otherwise all children processes will try to create the same file. If the problem is spawn vs fork, could you please try moving the
|
I wonder if the the above failure is something that we need to fix. Do we actually need some uuid/temp dir/name for extensions loading? Otherwise, it seems would fail if multiple process try to load the same extension? |
3aa4b39
to
8324090
Compare
8324090
to
17ef172
Compare
@mrshenli Yes I think we should use temp names for extensions loading. I will file an issue for it after reproducing it with a smaller example. |
@pritamdamania87 can you review and approve this or provide clear guidance on what is needed to have this approved. |
@VitalyFedyunin hello, Would you pls help to review the PR? |
I will pass it to @mrshenli as he is most experienced in distributed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I apologize that I dropped the ball on this, and sorry about the long delay on this PR. Let's expedite landing processing for this.
The current version looks great to me. I have three main comments
- ideally we should have a well-designed and stable PG interface before we landing this PR. But as I am not sure when that will come, let's don't block landing this PR due to that. If we need to change the PG interface in the future, we will modify this as well. However, if your extension lives out of PyTorch repo, it means there might be conflicts. How bad will that be if we break your extension? @ftian1
Also, @agolynski please drop notes if you have comments on this.
- Looks like the following will be THE interface for PG constructor. This might not be sufficient for future PG implementation if they require additional configurations. But we should be able address those by adding sth like
backend_options
later when that requirement emerges. As for now, is there any reason for adding the group_name here? NCCL and GLOO does not require a name but does require a timeout. Shall we at least pass the timeout to the PG as well so that implementations can respect timeout if they would like to?
pg = getattr(Backend, backend.upper())(
prefix_store,
rank,
world_size,
group_name)
- Looks like the test require Ninja. Could you please explain why do we need that?
Hey @yf225, do we have ninja installed in all our CI test envs?
|
||
When manually importing this backend and invoking :func:`torch.distributed.init_process_group` | ||
with the corresponding backend name, the `torch.distributed` package runs on the new backend. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@agolynski since we open this up, we need to have some text clearly describing what APIs the 3rd-party backend should implement. Could you please prepare a PR for that after we land this one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure.
const std::string& groupName); | ||
|
||
static void ProcessGroupTestConstructor() __attribute__((constructor)) { | ||
py::object module = py::module::import("torch.distributed"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Who is holding GIL for these function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
according to pybind11 doc, when calling a C++ function from Python, the GIL is always held.
test/test_distributed.py
Outdated
@@ -30,6 +31,12 @@ | |||
|
|||
skipIfNoTorchVision = unittest.skipIf(not HAS_TORCHVISION, "no torchvision") | |||
|
|||
CPP_EXTENSIONS_WARNING = """ | |||
Ninja (https://ninja-build.org) must be available to run C++ extensions tests, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious, why do we need ninja for this test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's because torch c++ extension build needs Ninja. it's copied from test/run_test.py (you can see same thing in run_test.py)
return std::make_shared<ProcessGroupTest::WorkTest>(); | ||
} | ||
|
||
std::shared_ptr<ProcessGroup::Work> ProcessGroupTest::allreduce_coalesced( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@agolynski do you have plans to make this allreduce_base
as well? Let's not further delay this PR. Shall we land this now, and when you add the change and doc for allreduce_base, could you please modify this accordingly if necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will add allreduce_base on the refreshed PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is allreduce_base
as allreduce
already expects a just single tensor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will leave it to @agolynski to comment on the plan for allreduce/allgather_base
API. :)
prefix_store, | ||
rank, | ||
world_size, | ||
group_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean all 3rd party backend must use the same signature? What about timeout
? and why do we need to pass in group_name here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mrshenli understand your concern. At my local first version, I added timeout and no group_name, but later removed it as timeout is not necessary for our c10d backend.
I would add timeout back for all 3rd party backend.
Hey @agolynski, could you please check if the API proposed in this PR looks OK to you? Thanks! |
8de7afb
to
93a3692
Compare
@mrshenli @agolynski I updated the PR according to your comments, pls let me know if there needs further updates. many thanks for your patients |
93a3692
to
a21e6e2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for putting this together! Overall, it LGTM! I left some minor comments on documentation.
@agolynski will comment on the c10d API.
@@ -395,6 +395,24 @@ of 16 | |||
.. autofunction:: all_gather_multigpu | |||
|
|||
|
|||
Third-party backends | |||
-------------------- | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we mark this as experimental for now? You can add sth like:
.. warning::
The third-party backend API is experimental and subject to change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will fix it
return value | ||
|
||
@classmethod | ||
def register_backend(cls, name, func): | ||
"""Registers a new backend. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it show properly in the doc? Could you please try the following and paste the screenshot here? Thanks!
cd docs/
pip install -r requirements.txt
make html
You can check the result by using this gist to serve your docs and paste a screenshot in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mrshenli, make html would fail with the below error.
FileNotFoundError: [Errno 2] No such file or directory: 'katex': 'katex'
but lucky thing is distributed.html has been generated. I paste the screenshot for your reference.
https://user-images.githubusercontent.com/16394660/76596980-70036280-653a-11ea-8d57-f36ee54894c2.png
|
||
Arguments: | ||
name (str): Backend name matching with the one in `init_process_group()`. | ||
func (function): Function handler that instantiates the backend. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we expect the function to take certain arguments (prefix_store
, rank
, world_size
, and timeout
), can we explain those args here as well?
Ideally, we might want to make this function first class citizen as well with clear signature and documents. But as we can mark this feature as experimental, we should be able to have some room to change it in the future when necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will add comments here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test failure is real:
Mar 10 01:34:51 Traceback (most recent call last):
Mar 10 01:34:51 File "test_determination.py", line 97, in test_torch_file
Mar 10 01:34:51 "test_determination",
Mar 10 01:34:51 AssertionError: Lists differ: ['distributed/test_distributed', 'test_cpp_[75 chars]ion'] != ['test_cpp_extensions_aot_ninja', 'test_cpp[43 chars]ion']
Mar 10 01:34:51
The original behavior of pytorch c10d only supports built-in backends, such as nccl/gloo/mpi. This patch is used to extend the c10d capability to support 3rd party communication libraries which are derived from ProcessGroup base class. related RFC is in: pytorch#27955 Through this way, user just need manually import this backend and specify the backend name when invoking torch.distributed.init_process_group(). The proposed logic will check if the backend is registered through torch.distributed.Backend.register_backend(). As for how to develop a new 3rd party backend through cpp extension, pls refer to test/cpp_extensions/cpp_c10d_extension.cpp
a21e6e2
to
afe3c0b
Compare
@mrshenli I refine the doc/code according to your comment. pls let me know if it needs further update. thanks |
May I request your time to make progress on this PR? As I mentioned earlier, this PR and #32361 are critical for our work with @srinivas212 and @dmudiger on distributed DLRM training on CPUs. |
Sorry about the delay. @agolynski is working on the last pass of the review and if everything looks good, he will land this soon. Thanks for contributing! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@agolynski has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
@ftian1 I am trying out a third-party backend using a custom extension as implemented in this PR. However, new_group() creation does not send the ranks list in the sub-group to the registered method in the extension. Could you suggest how to get this? Here's the function call I am referring to pytorch/torch/distributed/distributed_c10d.py Line 724 in b9f0277
|
@esaliya could you pls refer to this example? https://github.com/intel/torch-ccl/blob/master/torch_ccl/csrc/init.cpp it's a backend we added for this PR |
Thanks @ftian1. I looked at it but it too doesn't have a way to get the list of
in the second line, the list of ranks, pytorch/torch/distributed/distributed_c10d.py Line 724 in b9f0277
Any thoughts on this? |
ok, I understand. when I added such dynamic loading mechanism I didn't take such sub_group/group_ranks into count. so if your backend requests such capability, you have to contribute a PR to add this |
Thanks, for confirming this. |
The original behavior of pytorch c10d only supports built-in c10d backends, such as
nccl/gloo/mpi. This patch is used to extend the c10d capability to support dynamically
loading 3rd party communication libraries which are derived from ProcessGroup base class.
related RFC is in: #27955
Through this way, user just need specify a 3rd party c10d backend name when invoking
torch.distributed.init_process_group(). The proposed logic will try to load corresponding
c10d backend cpp extension automatically. as for how to develop a new 3rd party c10d backend
through cpp extension, pls refer to test/cpp_extensions/cpp_c10d_extension.cpp