-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Remove support for single process multi-GPU #4531
Comments
Yes! This is one of reasons why in DL land horovod got so popular. The single process story of TensorFlow was too complicated, and didn't easily translate to multinode either. Nowadays most multi gpu code I see works with horovod and torch.distributed. Both are in the multiprocess+single-gpu-per-process paradigm. This also naturally scales to multiple nodes without any code editting and provides the user (and dev) with a simpler multigpu programming model. |
I like the idea of decoupling data distribution from algorithm. But I insist that this is better done in c++ instead of high level libraries like dask, there are R and Julia packages that don't attract much attentions as others. I want least maintenance burden added to language bindings. Those thick bindings are much harder to maintain than current slightly bloated internal code. Also currently both Java and scala bindings are maintained by @CodingCat , we might need to join the development or at least being part of review process if language dependent external packages are used. |
@TimZaman I thought the way TF distributing subgraph is quite elegant. Seems I'm a little behind on this. :) |
Agree with @TimZaman. RAPIDS/cuML is also following a one process per GPU (OPG) paradigm and we feel it's much simpler to develop and maintain (no hard evidence but just from our initial development compared to single process multi-GPU). |
Pros:
Cons:
I would have wanted xgboost multi-GPU to work very well, using nvlink or whatever. But if that's not going to happen, then there's not much point in keeping the multi-GPU capability in xgboost. |
@trivialfis I don't see how to do this. Consider that HostDeviceVector is everywhere through our API, the purpose of this RFC would be to remove the complexity of multi-GPU data partitioning from data structures like this. I look at this as a question of scope. What functionality are we as the developers with limited resources prepared to provide. It's very tempting to try to do everything but I think many of the features we currently for GPU have are not implemented well enough for my liking and the quality of the code base is not at the level I want it to be. You are correct that we will not provide out-of-the-box functionality to use multiple GPUs on R and Julia but @pseudotensor is also right that the multi-GPU solution is not particularly compelling vs. single GPU. I would personally be satisfied if we produce a high quality building block that others could use to implement a distributed machine learning system. |
@RAMitchell One problem at a time.
Let's devide and conquer. :) |
I'd start with Design Doc and POC showing how to parallalize the algorithm without introducing too much overhead. Correct me if I'm wrong, but finding best splits requires At the end if it scales good enough it overweights downsides like extra 3rd party library. |
Communication throughput via NCCL is the same across processes as it is for the threaded version. Running the algorithm in parallel is essentially a question of starting a rabit tracker and connecting workers to it. We can certainly document this more to enable others to potentially incorporate this into their distributed framework. I think some kind of blog post explaining the distributed architecture would be helpful for a lot of people. |
I’m ok with this path - robustness wins over features. |
I think we have collected enough feedback without significant objections from customers. I will file a PR enabling deprecation warnings and updating documentation. We will aim to start removing code around September. |
@RAMitchell Is this line a bug? Line 615 in ea44417
|
I've never seen this error message and we even have distributed gpu tests now, so it must not be reaching this code block. |
The documentation has been updated and deprecation warnings will now be issued. I will leave this issue open until we start removing this feature in case of any further feedback. |
Looks like half the cuda code is dealing with multi-gpu. Does anyone have any idea on how to proceed? Just bite the bullet and forge ahead? :) Some of the DMatrix refactoring might be blocked by this. |
@rongou @RAMitchell Can we close this now? What we have to do is gradually making code clean up. |
I'm working on #4773 which should be bulk of the changes needed. Maybe we can close this after it goes in? |
This issue proposes removing support for single process multi-GPU in xgboost. What this means is that we would no longer support the 'n_gpus' parameter and the ability to train with multiple GPUs without some distributed framework.
We would focus on multi-GPU support through distributed frameworks such as Dask (see #4473) and Spark, where the framework would be expected to assign a worker process for each GPU independently.
The reasoning is that it would greatly simplify all GPU code in xgboost. Currently GPU code works by accepting an input DMatrix and then manually partitioning rows across each available GPU. This partitioning as well as managing multiple GPUs via threads imposes significant complexity on developers. The result is more code that is harder to read, more bug prone (we have many issues documenting this) and less flexible when developing new features. It is difficult to overstate how beneficial this would be for developers currently working on GPU code.
By doing this we are effectively deferring the task of partitioning data for a single machine with multiple GPUs onto the higher level distributed framework.
In the short term this will be a direct removal of an existing feature, but I believe in the medium term (1 year) it will result in a significantly better product.
We would provide a path for users to transition to new multi-GPU support. This will be better for users in the long run also as these distributed solutions are arbitrarily scalable. Between making the decision to deprecate this functionality and removing code we would wait at least three months.
This is still very much up for debate and feedback is welcome from developers and users, particularly those who currently use multi-GPU support as a part of their workflow.
The text was updated successfully, but these errors were encountered: