🚀[FEA]: Distributed Training/Inference: handle scatter/gather better and more consistently #520
Labels
? - Needs Triage
Need team to review and classify
distributed
Distributed and model parallel tools
enhancement
New feature or request
Is this a new feature, an improvement, or a change to existing functionality?
Improvement
How would you describe the priority of this feature request
Low (would be nice)
Please provide a clear description of problem you would like to solve.
Problem exists in model-parallel settings where not all ranks have valid tensors, mainly around gather and scatter routines.
Scatter
dtype
, and other meta-information likerequires_grad
to not break training pipelinesGather
torch.autograd.Function
None
on all participating ranks, as it could be more informative to have an object carrying information about thisNone
just being the null-part of a distributed tensor which currently is valid on rank XPotential Solution
TensorPlaceholder
which carries meta-data on ranks where the tensor is currently not valid and is more informative than justNone
Describe any alternatives you have considered
No response
The text was updated successfully, but these errors were encountered: