NOTE: As of torch-1.7 and torchelastic-0.2.1 torchelastic will be bundled into the main pytorch docker image. torchelastic/examples will be available post torch-1.7 release since its base image will now be pytorch/pytorch
-
Torchelastic agent:
run_id
available to workers asTORCHELASTIC_RUN_ID
environment variable- Allow
max_restarts=0
- Worker exit barrier added to torchelastic agent to protect against variances in worker finish times
- Improvements to error handling and propagation from torchelastic agent
- Enable fault handlers on worker processes to get torch C++ stack traces
-
torchelastic.distributed.launch
CLI:- New option
--role
to allow users to set worker role name - CLI options can now be set via environment variables (e.g.
PET_NNODES="1:2"
)
- New option
-
Project:
- Upgraded to Python 3.8
- Tests moved to
test
directory within the respective modules - Use Pyre
-
Deprecated:
- pytorch/elastic Docker image
-
Experimental:
- Training Session Manager (TSM) with localhost scheduler
- torchelastic.multiprocessing
- Separate infrastructure related work from the user script. DesignDoc
- Events API
- First release torchelastic v0.1.0rc1 (experimental)