-
Notifications
You must be signed in to change notification settings - Fork 260
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable torch elastic training (torchrun) #1603
Merged
Merged
Changes from 25 commits
Commits
Show all changes
33 commits
Select commit
Hold shift + click to select a range
9527795
Add torch elastic task type
c0616a8
updated
kumare3 c9edd3d
updated
kumare3 153e0d6
Don't pass min_nodes, max_nodes but nnodes only
264b6c6
Add docstrings
1a68e44
Cleanup test
4e73fb4
Replace wrong occurences of mpi in torch-elastic plugin
da17096
Extend kfpytorch plugin README
2073b0a
Move elastic task into existing kf-pytorch plugin
40f8f46
Add Elastic config to plugin's __init__
90c966e
Set elastic config in pytorchjob proto
d9567b5
removed unnecessary model files and simplified codebase
kumare3 fd2dbaf
Configure rdzv endpoint
e6a74ba
Configure worker count also for elastic training
007ab27
Fix exception scope and handle non-rank-0 outputs
fg91 a5b792f
Remove todo item about handling exception scope, now done
fg91 252e9d9
Add note about c10d backend
fg91 daa387f
Let user set min and max replicas explicitly and not via nnodes
fg91 5664aae
Catch torch import error and configure for flytekitplugins-kfpytorch…
fg91 d45a23e
Add more tests
fg91 2864d3b
Update plugins/flytekit-kf-pytorch/flytekitplugins/kfpytorch/task.py
fg91 b991e90
Lint
fg91 f0d8f6e
Explicitly add flyteidl version to plugin
kumare3 16bc6e5
changing import
kumare3 77fceb0
removed dynamic execute
kumare3 c96fa30
Revert "removed dynamic execute"
kumare3 a31fb47
updated ignoreoutputs
kumare3 da8a94c
requirements rebuilt
kumare3 28ade30
Revert back to nnodes instead of min_replicas, max_replicas, replicas
fg91 27fcb29
Don't handle dynamic execution scope
fg91 00aeb3c
Amend test to new nnodes api
fg91 312a436
Merge branch 'master' into fabio/feat/torch-elastic-plugin-fix
kumare3 a68f8c4
updated types
kumare3 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -10,4 +10,4 @@ | |
PyTorch | ||
""" | ||
|
||
from .task import PyTorch | ||
from .task import Elastic, PyTorch |
23 changes: 0 additions & 23 deletions
23
plugins/flytekit-kf-pytorch/flytekitplugins/kfpytorch/models.py
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,73 @@ | ||
import os | ||
from dataclasses import dataclass | ||
|
||
import pytest | ||
import torch | ||
import torch.distributed as dist | ||
from dataclasses_json import dataclass_json | ||
from flytekitplugins.kfpytorch.task import Elastic | ||
|
||
from flytekit import task, workflow | ||
|
||
|
||
@dataclass_json | ||
@dataclass | ||
class Config: | ||
lr: float = 1e-5 | ||
bs: int = 64 | ||
name: str = "foo" | ||
|
||
|
||
def dist_communicate() -> int: | ||
"""Communicate between distributed workers.""" | ||
rank = torch.distributed.get_rank() | ||
world_size = dist.get_world_size() | ||
tensor = torch.tensor([5], dtype=torch.int64) + 2 * rank + world_size | ||
dist.all_reduce(tensor, op=dist.ReduceOp.SUM) | ||
|
||
return tensor.item() | ||
|
||
|
||
def train(config: Config) -> tuple[str, Config, torch.nn.Module, int]: | ||
"""Mock training a model using torch-elastic for test purposes.""" | ||
dist.init_process_group(backend="gloo") | ||
|
||
local_rank = os.environ["LOCAL_RANK"] | ||
|
||
out_model = torch.nn.Linear(1000, int(local_rank) + 1) | ||
config.name = "elastic-test" | ||
|
||
distributed_result = dist_communicate() | ||
|
||
return f"result from local rank {local_rank}", config, out_model, distributed_result | ||
|
||
|
||
@pytest.mark.parametrize("start_method", ["spawn", "fork"]) | ||
def test_end_to_end(start_method: str) -> None: | ||
"""Test that the workflow with elastic task runs end to end.""" | ||
world_size = 2 | ||
|
||
train_task = task(train, task_config=Elastic(replicas=1, nproc_per_node=world_size, start_method=start_method)) | ||
|
||
@workflow | ||
def wf(config: Config = Config()) -> tuple[str, Config, torch.nn.Module, int]: | ||
return train_task(config=config) | ||
|
||
r, cfg, m, distributed_result = wf() | ||
assert "result from local rank 0" in r | ||
assert cfg.name == "elastic-test" | ||
assert m.in_features == 1000 | ||
assert m.out_features == 1 | ||
""" | ||
The distributed result is calculated by the workers of the elastic train | ||
task by performing a `dist.all_reduce` operation. The correct result can | ||
only be obtained if the distributed process group is initialized correctly. | ||
""" | ||
assert distributed_result == sum([5 + 2 * rank + world_size for rank in range(world_size)]) | ||
|
||
|
||
def test_bad_replica_config() -> None: | ||
"""Test that bad replica config is caught.""" | ||
|
||
with pytest.raises(ValueError): | ||
task(train, task_config=Elastic(replicas=1, min_replicas=2)) |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we just get rid of replicas?
and just make num_workers = max_replicas?
is that not right?
is replicas supposed to be desired_replicas?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we can get this behaviour by reverting daa387f.
I brought this up here because I was very unsure whether we should expose pytorch's
nnodes
to the user or Kubeflow'sminReplicas
,maxReplicas
,replicas
.Kubeflow unfortunately provides zero docs about this. This is how they set defaults for
minRepliacs
andmaxReplicas
.They have two elastic examples. In the first one, they specify:
This is basically what you proposed and what we had before daa387f:
In the second example, however,
maxReplicas
>replicas
:They use HPA to scale up nodes based on the configured metrics.
But I now realize that
maxReplicas>replicas
only makes sense if you use HPA, otherwise there is no way that you get more than the initial replicas, do you agree?That being said, using metrics/HPA to dynamically scale up nodes, from everything I can see in the the torch docs, appears to be a Kubeflow design decision and does not appear to come from
torchrun
itself. The pytorch docs only talk about Membership changes in general since they don't take care of provisioning nodes at all anyways:My personal opinion:
Being able to continue training if one node dies is a nice feature.
I personally wouldn't want to use e.g. cpu metrics + HPA to add a new distributed training worker: Since HPA might scale nodes up and down all the time, the worker groups might end up getting restarted (on every membership change) far more often than necessary. And training code often needs a bit of time for setting things up, loading the model, ... before the actual training starts. So I'm not convinced of this feature...
To summarize:
If, after digging into this again, I now understand correctly, we would have to expose the metrics as well if we want to allow our users to set
max_replicas>replicas
. I suggest to not do this, at least in version 1, and revert back to only exposingnnodes
.What do you think @kumare3 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fg91 I seem to agree. but I was thinking instead of nnodes keep min_replica and max_replica? Why not? though I will go with whatever you decide.
I actually am not really sure if HPA works with gloo correctly. Its not really easy to change the membership and work correctly. lets reserve this for later. Lets get the basic version out. If you make the change I can +1 tonight and merge. I have been testing on single node and it seems to work great.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a slight preference for
nnodes=1 / "1:2"
overmin_replicas
,max_replicas
for the following reasons:torchrun
'snnodes
and Kubeflow'sreplicas
,minReplicas
,maxReplicas
that neither user knowstorchrun
simply because either a project like Alpacca or a library like ignite assumes it or because they want to do distributed training on a single node without operator. It should be most simple for those users who don't think about min/max and I feel thatnnodes=2
is simpler for them thanmin_replicas=2
,max_replicas=2
. (+ if they come from pytorch, they already know the syntax.)If it's ok for you I will change back to
nnodes
and ping you again for review :)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
5552413