-
-
Notifications
You must be signed in to change notification settings - Fork 617
Roadmap
This document lists general directions that core team is interested to see developed in PyTorch-Ignite.
We are using Github Projects to define our different goals: releases, particular milestones etc.
- continue maintaining high-quality, well-tested and documented modules.
- provide distributed framework support via
ignite.distributed
: XLA (e.g. TPU), Horovod - provide new higher-level API based on
Engine
to simplify the usage while keeping flexibility as a contrib module - provide helper on data management via
ignite.data
: sampling, multi-dataloaders - provide more integrations with other tools to simplify Machine/Deep Learning end-to-end applications.
- visibility and communications
- add typing to the whole package
- adapt the code and add mypy check
- merge contrib module into principal library ?
- Provide helper docker images to quick-start with a task
- https://hub.docker.com/orgs/pytorchignite
-
XLA devices support via pytorch/xla
-
Horovod
-
Explore DDP + RPC
-
Better support different types of parallelism: data, model, pipeline.
- All metrics work in distributed
- configurable distributed metrics reduce/gather methods
- Minor improvements:
- better support of sklearn metrics
- Classification metrics with micro/macro options
- Metrics for NLP:
ROUGE,BLEU, METEOR, PPL - Metrics for GANs:
FID, PPL (#998)
See also related GSoC 2021 project idea description
- push-button contrib trainers with AMP, distributed etc
- automatic batch size via toma
See also related GSoC 2021 project idea description
Engine of 0.4.x version contains several major bugs related to the way we implemented events triggering and counting. In this case, events filtering requires state and corresponding attributes to be available which is not a nice design. To solve the following issues : https://github.com/pytorch/ignite/issues?q=is%3Aissue+is%3Aopen+label%3A%22module%3A+engine%22 it requires major Engine redesign while keeping as much as possible the backward compatibility.
The idea is to split Engine(Serializable)
-> Engine(Serializable, EventsDriven)
where EventsDriven
is a class responsible for events registration, triggering etc. Thus Engine
will have only the logic to register necessary events and about how to run two loops.
Exposing run_one_epoch
publicly would help user to combine their custom outer loops with Engine's one.
Required here:
Tricky part is to resume from the stopped iteration if epoch length is not data size or data is an iterable.
Details
Currently, we have a bit unclear engine's behavior about when restart from the beginning and when to continue.
Currently
# (re)start from 0 to 5
engine.run(data, max_epochs=5) -> Engine run starting with max_epochs=5 => state.epoch=5
# continue from 5 to 7
engine.run(data, max_epochs=7) -> Engine run resuming from iteration 50, epoch 5 until 7 epochs => state.epoch=7
# error
engine.run(data, max_epochs=4) -> ValueError: Argument max_epochs should be larger than the start epoch
# restart from 0 to 7 (As state.epoch == max_epochs(=7), this should be like that as we always do: evaluator.run(data) without any other instructions)
engine.run(data, max_epochs=7) -> Engine run starting with max_epochs=7 => state.epoch=7
# forced restart from 0 to 5
engine.state.max_epochs = None
engine.run(data, max_epochs=5) -> Engine run starting with max_epochs=5 => state.epoch=5
# forced restart from 0 to 9, instead of continue from state.epoch=7
engine.state.max_epochs = None
engine.run(data, max_epochs=9) -> Engine run starting with max_epochs=9 => state.epoch=9
A proposition to change it slightly: "error" case and ugly engine.state.max_epochs=None
solution.
Proposed API
# SAME. (re)start from 0 to 5
engine.run(data, max_epochs=5) -> Engine run starting with max_epochs=5 => state.epoch=5
# SAME. continue from 5 to 7
engine.run(data, max_epochs=7) -> Engine run resuming from iteration 50, epoch 5 until 7 epochs => state.epoch=7
# As max_epochs=4 <= state.epoch=7 => restart
engine.run(data, max_epochs=4) -> Engine run starting with max_epochs=4 => state.epoch=4
# restart from 0 to 4
engine.run(data, max_epochs=4) -> Engine run starting with max_epochs=4 => state.epoch=4
# Now (not forced) restart from 0 to 3 (as max_epochs=3 <= state.epoch=4 => restart)
engine.run(data, max_epochs=3) -> Engine run starting with max_epochs=3 => state.epoch=3
# SOMETHING TO CHANGE HERE. Forced restart from 0 to 9, instead of continue from state.epoch=3
engine.state.max_epochs = None # maybe, engine.reset() -> state.epoch=state.iteration=0,state.max_epochs=state.max_iters=None
engine.run(data, max_epochs=9) -> Engine run starting with max_epochs=9 => state.epoch=9
# In case of max_iters, we'll have to do:
engine.run(data, max_iters=100) -> Engine run starting with max_iters=100 => state.iteration=100
engine.state.max_iters = None
engine.run(data, max_iters=100) -> Engine run starting with max_iters=100 => state.iteration=100
# So there is no uniform API to restart engine...
- Fix #1521 issue
- better and simple coverage of multi-dataloaders use-cases, e.g. GAN, SSL, etc
- Verify compatibility (if ignite is not blocking) writing applications for Federated Learning
- Verify compatibility (if ignite is not blocking) writing applications with Distributed RPC framework
- More applications and successful stories with PyTorch-Ignite
- Showcase via ClearML Ignite server :
- more experiments with Ignite from our users
PyTorch-Ignite presented to you with love by PyTorch community