You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As can be seen in the picture below, almost half of the GPU sampling runtime is spent on _convert_to_sampled_subgraph 3ms every time it is called in the hetero sampling example. This should take around 0.1ms with an optimized custom implementation. The main culprit is the use of torch.nonzero causing a CPU GPU synchronization to read the size of the nonzero ids each time it is called.
The text was updated successfully, but these errors were encountered:
The main culprit is the use of torch.nonzero causing a CPU GPU synchronization to read the size of the nonzero ids each time it is called.
@mfbalin The direct cause of sync is the tensor data that torch.nonzero operates on is not on GPU side while the sampled_csc is targeted on GPU?
No, the operation happens on GPU. The reason is that nonzero checks if tensor elements are nonzero, however it is unknown how many will be nonzero. This information is needed on CPU side to allocate output tensor so there is a GPU CPU synchronization.
Calling nonzero for each etype in the graph makes this issue much worse, each time adding a fixed but really high overhead.
🔨Work Item
IMPORTANT:
Project tracker: https://github.com/orgs/dmlc/projects/2
Description
As can be seen in the picture below, almost half of the GPU sampling runtime is spent on
_convert_to_sampled_subgraph
3ms every time it is called in the hetero sampling example. This should take around 0.1ms with an optimized custom implementation. The main culprit is the use oftorch.nonzero
causing a CPU GPU synchronization to read the size of the nonzero ids each time it is called.The text was updated successfully, but these errors were encountered: