-
Notifications
You must be signed in to change notification settings - Fork 412
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[feat] Add sync_context and sync to nn.Metric #302
Conversation
for more information, see https://pre-commit.ci
Codecov Report
@@ Coverage Diff @@
## master #302 +/- ##
==========================================
+ Coverage 96.35% 96.44% +0.08%
==========================================
Files 97 97
Lines 3241 3259 +18
==========================================
+ Hits 3123 3143 +20
+ Misses 118 116 -2
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
Hello @tchaton! Thanks for updating this PR. There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2021-06-21 09:41:26 UTC |
for more information, see https://pre-commit.ci
…etrics into apply_sync_fn
for more information, see https://pre-commit.ci
…etrics into apply_sync_fn
for more information, see https://pre-commit.ci
…etrics into apply_sync_fn
for more information, see https://pre-commit.ci
…etrics into apply_sync_fn
seems constantly failing on PT 1.6
|
Thanks. Resolved. |
@tchaton any reason/justification why the tests take more than an extra 5min? |
Great question. I am investigating. @SkafteNicki @justusschock Any idea ? Best, |
self, | ||
dist_sync_fn: Optional[Callable] = None, | ||
process_group: Optional[Any] = None, | ||
should_sync: bool = True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused to see should_sync=True|False
.
If you set False, this method does nothing, so it's the same as not calling sync in the first place!
Then, if you set True but dist is not available, it will do nothing so basically it does not what the user wants.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should_sync
means should_sync if possible :) Modified the docstring to reflect this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It means now that there are two arguments overlapping: dist_sync_fn
and should_sync
You can do this: should_sync=False
and dist_sync_fn=mean
what willl happen now? will it sync or not?
@PyTorchLightning/core-metrics be aware of these cases
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I agree. I wonder if the main usage of should_sync
is just in sync_context
and maybe we should just decide there if syncing is needed or not? Doing an if
with context manager is a bit harder and might justify a flag, but for a function, it should be easy for people to just avoid calling it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@maximsch2 your argument is to keep the flag for the context manager but remove it from this function, correct?
I think that would be fine.
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
…etrics into apply_sync_fn
for more information, see https://pre-commit.ci
Late to the party here, but I think we can also imagine the future where models are huge and sharded (with FSDP) and metric states are similarly sharded. We are getting away with synchronizing everything on rank0 for now but long-term we might need to have metrics that wont' be able to do that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tchaton I had a different idea when proposing sync
: the Metric subclass should provide the implementation, not the base. The metric states are updated in-place instead of returned to the caller. So one could chain together calls to update
and sync
before finally calling compute
to get the state.
dist_sync_fn: Optional[Callable] = None, | ||
process_group: Optional[Any] = None, | ||
should_sync: bool = True, | ||
distributed_available: Optional[Callable] = distributed_available, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
n00b question: why does distributed available need to be an argument?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because torchmetrics
does not know about any distributed platforms other than CUDA
For example TPU, IPUs...
if cache and restore_cache: | ||
# if we synced, restore to cache so that we can continue to accumulate un-synced state | ||
for attr, val in cache.items(): | ||
setattr(self, attr, val) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the use case for this? If we sync, we should assume all metrics are operating off the synced state and not accumulate local changes, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was here already, just moved.
Added in dd1e744
cc: @SkafteNicki
) | ||
|
||
for attr, reduction_fn in self._reductions.items(): | ||
# pre-processing ops (stack or flatten for inputs) | ||
if isinstance(output_dict[attr][0], Tensor): | ||
if isinstance(output_dict[attr], Sequence) and isinstance(output_dict[attr][0], Tensor): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this check is not safe. we're seeing errors as a result.
if isinstance(output_dict[attr], Sequence) and isinstance(output_dict[attr][0], Tensor):
IndexError: list index out of range
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed in #311
What does this PR do?
This PR add a more generic
_apply_sync
function to nn.Metric base class.Fixes: #67
Reasoning:
User might want to perform a reduction but not perform compute.
Current use case:
In Lightning, we are enabling restart in mid-epoch.
To do this, we need to save the synchronised states across process on rank 0.
Therefore, the compute call is just an overhead and should be skipped.
Did you have fun?
Make sure you had fun coding 🙃