-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Performance][CUDA] Labor UVA optimization #5885
Conversation
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
@BarclayII could you take a look? On machines with multiple GPUs where PCI-e bandwidth is highly contested, this PR should improve things by quite a lot. |
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
0524e63
to
ffecadf
Compare
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
c3fa494
to
21f505e
Compare
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
eccb068
to
2f7ee29
Compare
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
Not authorized to trigger CI. Please ask core developer to help trigger via issuing comment:
|
@frozenbugs |
Not authorized to trigger CI. Please ask core developer to help trigger via issuing comment:
|
893614c
to
009be59
Compare
Not authorized to trigger CI. Please ask core developer to help trigger via issuing comment:
|
Not authorized to trigger CI. Please ask core developer to help trigger via issuing comment:
|
Let's get this PR merged first, and let's discuss next week to find a path to integrate it in graphbolt. @mfbalin |
I would also like to make a PR including the kappa feature I talked about in my presentation. It modifies only the random number generation logic and adds the kappa parameter to the sampler API. However, it won't be useful if #4341 or an alternative dynamic cache is not merged. With the addition of the kappa PR, the official LABOR sampling implementation living inside DGL will be complete. |
@caojy1998 FYI. We need to add benchmark for |
OK, got it. |
Is the lack of a benchmark script blocking the merge of this PR or are we waiting for another reviewer's approval? |
no, not blocking, let's run the CI, and I will merge it. |
Co-authored-by: Xin Yao <xiny@nvidia.com> Co-authored-by: Rhett Ying <85214957+Rhett-Ying@users.noreply.github.com>
Co-authored-by: Xin Yao <xiny@nvidia.com> Co-authored-by: Rhett Ying <85214957+Rhett-Ying@users.noreply.github.com>
Description
Labor Sampler can perform UVA sampling. However, since it has to access the indptr and indices arrays (especially indices) very often, it slows down by a lot compared to NeighborSampler slowdown. This PR focuses on optimizing the accesses to indptr and indices arrays in a way that doesn't bring any regressions to the pure GPU scenario whereas improving training throughput with fanout 10,10,10 by more than 10 percent on ogbn-products training with
use_uva=True
. With this change,ogbn-products
UVA sampling runtime for labor went from 18ms to 13ms on an A100 machine during multi-GPU training. On other multi-GPU machines with V100s, the sampling time is more than 2x faster.Also fixed a bug for the weighted case. Added some tests to this PR so that we can detect if things are working properly. Below are the profiles of the old and the new code for LABOR-0 UVA. The new version has the extra OneHopExtractorAlignedKernel that copies the indices array in an aligned manner if it is pinned prior to the main sampling kernel.
Checklist
Please feel free to remove inapplicable items for your PR.