[GraphBolt][CUDA] `_convert_to_sampled_subgraph` has too many GPU synchronizations. #6887

mfbalin · 2024-01-02T23:19:05Z

🔨Work Item

IMPORTANT:

This template is only for dev team to track project progress. For feature request or bug report, please use the corresponding issue templates.
DO NOT create a new work item if the purpose is to fix an existing issue or feature request. We will directly use the issue in the project tracker.

Project tracker: https://github.com/orgs/dmlc/projects/2

Description

As can be seen in the picture below, almost half of the GPU sampling runtime is spent on _convert_to_sampled_subgraph 3ms every time it is called in the hetero sampling example. This should take around 0.1ms with an optimized custom implementation. The main culprit is the use of torch.nonzero causing a CPU GPU synchronization to read the size of the nonzero ids each time it is called.

The text was updated successfully, but these errors were encountered:

mfbalin · 2024-01-02T23:19:32Z

@Rhett-Ying @yxy235

Rhett-Ying · 2024-02-26T08:20:21Z

The main culprit is the use of torch.nonzero causing a CPU GPU synchronization to read the size of the nonzero ids each time it is called.

@mfbalin
The direct cause of sync is the tensor data that torch.nonzero operates on is not on GPU side while the sampled_csc is targeted on GPU?

mfbalin · 2024-02-26T14:58:16Z

The main culprit is the use of torch.nonzero causing a CPU GPU synchronization to read the size of the nonzero ids each time it is called.
@mfbalin The direct cause of sync is the tensor data that torch.nonzero operates on is not on GPU side while the sampled_csc is targeted on GPU?

No, the operation happens on GPU. The reason is that nonzero checks if tensor elements are nonzero, however it is unknown how many will be nonzero. This information is needed on CPU side to allocate output tensor so there is a GPU CPU synchronization.

Calling nonzero for each etype in the graph makes this issue much worse, each time adding a fixed but really high overhead.

mfbalin added the Work Item Work items tracked in project tracker label Jan 2, 2024

mfbalin self-assigned this Jan 2, 2024

mfbalin removed their assignment Jan 3, 2024

This was referenced Jan 7, 2024

[GraphBolt][CUDA] Full GPU support for GraphBolt #6826

Open

[GraphBolt][CUDA] Eliminate GPU synchronizations as much as we can. #6910

Closed

mfbalin mentioned this issue Mar 24, 2024

[GraphBolt][CUDA] Optimize hetero sampling. #7223

Merged

8 tasks

mfbalin closed this as completed in #7223 Apr 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GraphBolt][CUDA] `_convert_to_sampled_subgraph` has too many GPU synchronizations. #6887

[GraphBolt][CUDA] `_convert_to_sampled_subgraph` has too many GPU synchronizations. #6887

mfbalin commented Jan 2, 2024 •

edited

Loading

mfbalin commented Jan 2, 2024

Rhett-Ying commented Feb 26, 2024

mfbalin commented Feb 26, 2024 •

edited

Loading

[GraphBolt][CUDA] _convert_to_sampled_subgraph has too many GPU synchronizations. #6887

[GraphBolt][CUDA] _convert_to_sampled_subgraph has too many GPU synchronizations. #6887

Comments

mfbalin commented Jan 2, 2024 • edited Loading

🔨Work Item

Description

mfbalin commented Jan 2, 2024

Rhett-Ying commented Feb 26, 2024

mfbalin commented Feb 26, 2024 • edited Loading

[GraphBolt][CUDA] `_convert_to_sampled_subgraph` has too many GPU synchronizations. #6887

[GraphBolt][CUDA] `_convert_to_sampled_subgraph` has too many GPU synchronizations. #6887

mfbalin commented Jan 2, 2024 •

edited

Loading

mfbalin commented Feb 26, 2024 •

edited

Loading