[feat] Add sync_context and sync to nn.Metric #302

tchaton · 2021-06-17T10:06:46Z

What does this PR do?

This PR add a more generic _apply_sync function to nn.Metric base class.

Fixes: #67

Reasoning:
User might want to perform a reduction but not perform compute.

Current use case:
In Lightning, we are enabling restart in mid-epoch.
To do this, we need to save the synchronised states across process on rank 0.
Therefore, the compute call is just an overhead and should be skipped.

Did you have fun?

Make sure you had fun coding 🙃

for more information, see https://pre-commit.ci

codecov · 2021-06-17T10:09:20Z

Codecov Report

Merging #302 (9872ddd) into master (f54ccca) will increase coverage by 0.08%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #302      +/-   ##
==========================================
+ Coverage   96.35%   96.44%   +0.08%     
==========================================
  Files          97       97              
  Lines        3241     3259      +18     
==========================================
+ Hits         3123     3143      +20     
+ Misses        118      116       -2

Flag	Coverage Δ
Linux	`76.79% <50.00%> (-0.03%)`	⬇️
Windows	`76.79% <50.00%> (-0.03%)`	⬇️
cpu	`96.37% <100.00%> (+0.02%)`	⬆️
gpu	`96.40% <100.00%> (?)`
macOS	`96.37% <100.00%> (+0.02%)`	⬆️
pytest	`96.44% <100.00%> (+0.08%)`	⬆️
python3.6	`95.41% <100.00%> (+0.02%)`	⬆️
python3.8	`96.28% <100.00%> (-0.08%)`	⬇️
python3.9	`?`
torch1.3.1	`95.41% <100.00%> (+0.02%)`	⬆️
torch1.4.0	`?`
torch1.9.0	`96.28% <100.00%> (+0.02%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
torchmetrics/metric.py	`95.51% <100.00%> (+0.27%)`	⬆️
torchmetrics/functional/regression/spearman.py	`97.77% <0.00%> (+4.44%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f54ccca...9872ddd. Read the comment docs.

…etrics into apply_sync_fn

pep8speaks · 2021-06-17T10:13:50Z

Hello @tchaton! Thanks for updating this PR.

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-06-21 09:41:26 UTC

for more information, see https://pre-commit.ci

…etrics into apply_sync_fn

SkafteNicki · 2021-06-17T12:38:41Z

Hi @tchaton,
Seems related to #67 so maybe this should be a public method?

for more information, see https://pre-commit.ci

…etrics into apply_sync_fn

torchmetrics/metric.py

for more information, see https://pre-commit.ci

…etrics into apply_sync_fn

for more information, see https://pre-commit.ci

…etrics into apply_sync_fn

tests/bases/test_ddp.py

torchmetrics/metric.py

Borda · 2021-06-18T13:05:34Z

seems constantly failing on PT 1.6

E       Exception: 
E       
E       -- Process 0 terminated with the following error:
E       Traceback (most recent call last):
E         File "/usr/share/miniconda3/envs/test/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
E           fn(i, *args)
E         File "/home/runner/work/metrics/metrics/tests/bases/test_ddp.py", line 146, in _test_state_dict_is_synced
E           metric(i)
E         File "/usr/share/miniconda3/envs/test/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
E           result = self.forward(*input, **kwargs)
E         File "/home/runner/work/metrics/metrics/torchmetrics/metric.py", line 189, in forward
E           self._forward_cache = self.compute()
E         File "/home/runner/work/metrics/metrics/torchmetrics/metric.py", line 326, in wrapped_func
E           self._computed = compute(*args, **kwargs)
E         File "/home/runner/work/metrics/metrics/tests/bases/test_ddp.py", line 139, in compute
E           return self.x / self.c
E       RuntimeError: Integer division of tensors using div or / is no longer supported, and in a future release div will perform true division as in Python 3. Use true_divide or floor_divide (// in Python) instead.

…etrics into apply_sync_fn

tchaton · 2021-06-18T13:59:26Z

seems constantly failing on PT 1.6

E       Exception: 
E       
E       -- Process 0 terminated with the following error:
E       Traceback (most recent call last):
E         File "/usr/share/miniconda3/envs/test/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
E           fn(i, *args)
E         File "/home/runner/work/metrics/metrics/tests/bases/test_ddp.py", line 146, in _test_state_dict_is_synced
E           metric(i)
E         File "/usr/share/miniconda3/envs/test/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
E           result = self.forward(*input, **kwargs)
E         File "/home/runner/work/metrics/metrics/torchmetrics/metric.py", line 189, in forward
E           self._forward_cache = self.compute()
E         File "/home/runner/work/metrics/metrics/torchmetrics/metric.py", line 326, in wrapped_func
E           self._computed = compute(*args, **kwargs)
E         File "/home/runner/work/metrics/metrics/tests/bases/test_ddp.py", line 139, in compute
E           return self.x / self.c
E       RuntimeError: Integer division of tensors using div or / is no longer supported, and in a future release div will perform true division as in Python 3. Use true_divide or floor_divide (// in Python) instead.

Thanks. Resolved.

Borda · 2021-06-18T14:11:02Z

@tchaton any reason/justification why the tests take more than an extra 5min?

tchaton · 2021-06-18T17:56:01Z

@tchaton any reason/justification why the tests take more than an extra 5min?

Great question. I am investigating. @SkafteNicki @justusschock Any idea ?

Best,
T.C

tests/bases/test_ddp.py

torchmetrics/metric.py

awaelchli · 2021-06-19T09:54:58Z

torchmetrics/metric.py

+        self,
+        dist_sync_fn: Optional[Callable] = None,
+        process_group: Optional[Any] = None,
+        should_sync: bool = True,


I'm confused to see should_sync=True|False.

If you set False, this method does nothing, so it's the same as not calling sync in the first place!
Then, if you set True but dist is not available, it will do nothing so basically it does not what the user wants.

should_sync means should_sync if possible :) Modified the docstring to reflect this.

It means now that there are two arguments overlapping: dist_sync_fn and should_sync

You can do this: should_sync=False and dist_sync_fn=mean

what willl happen now? will it sync or not?
@PyTorchLightning/core-metrics be aware of these cases

Yeah, I agree. I wonder if the main usage of should_sync is just in sync_context and maybe we should just decide there if syncing is needed or not? Doing an if with context manager is a bit harder and might justify a flag, but for a function, it should be easy for people to just avoid calling it?

@maximsch2 your argument is to keep the flag for the context manager but remove it from this function, correct?

I think that would be fine.

torchmetrics/metric.py

tests/bases/test_ddp.py

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

…etrics into apply_sync_fn

for more information, see https://pre-commit.ci

maximsch2 · 2021-06-29T00:06:31Z

Late to the party here, but I think we can also imagine the future where models are huge and sharded (with FSDP) and metric states are similarly sharded. We are getting away with synchronizing everything on rank0 for now but long-term we might need to have metrics that wont' be able to do that

ananthsub

@tchaton I had a different idea when proposing sync: the Metric subclass should provide the implementation, not the base. The metric states are updated in-place instead of returned to the caller. So one could chain together calls to update and sync before finally calling compute to get the state.

ananthsub · 2021-06-29T16:09:43Z

torchmetrics/metric.py

+        dist_sync_fn: Optional[Callable] = None,
+        process_group: Optional[Any] = None,
+        should_sync: bool = True,
+        distributed_available: Optional[Callable] = distributed_available,


n00b question: why does distributed available need to be an argument?

Because torchmetrics does not know about any distributed platforms other than CUDA

For example TPU, IPUs...

ananthsub · 2021-06-29T16:16:29Z

torchmetrics/metric.py

+        if cache and restore_cache:
+            # if we synced, restore to cache so that we can continue to accumulate un-synced state
+            for attr, val in cache.items():
+                setattr(self, attr, val)
+


What's the use case for this? If we sync, we should assume all metrics are operating off the synced state and not accumulate local changes, right?

This was here already, just moved.

Added in dd1e744
cc: @SkafteNicki

ananthsub · 2021-06-29T16:17:58Z

torchmetrics/metric.py

        )

        for attr, reduction_fn in self._reductions.items():
            # pre-processing ops (stack or flatten for inputs)
-            if isinstance(output_dict[attr][0], Tensor):
+            if isinstance(output_dict[attr], Sequence) and isinstance(output_dict[attr][0], Tensor):


this check is not safe. we're seeing errors as a result.

if isinstance(output_dict[attr], Sequence) and isinstance(output_dict[attr][0], Tensor): IndexError: list index out of range

Removed in #311

tchaton added 2 commits June 17, 2021 10:39

wip

80575de

add _apply_sync to nn.Metric

904ecec

tchaton added the API / design label Jun 17, 2021

tchaton added this to the v0.5 milestone Jun 17, 2021

tchaton self-assigned this Jun 17, 2021

[pre-commit.ci] auto fixes from pre-commit.com hooks

71ca9be

for more information, see https://pre-commit.ci

tchaton added 2 commits June 17, 2021 06:13

move to context manager

e4d99d8

Merge branch 'apply_sync_fn' of https://github.com/PyTorchLightning/m…

94d3450

…etrics into apply_sync_fn

pre-commit-ci bot and others added 3 commits June 17, 2021 10:14

[pre-commit.ci] auto fixes from pre-commit.com hooks

bfcbc74

for more information, see https://pre-commit.ci

resolve flake8

41a60e7

Merge branch 'apply_sync_fn' of https://github.com/PyTorchLightning/m…

b0498b4

…etrics into apply_sync_fn

tchaton and others added 4 commits June 17, 2021 13:06

add sync

94fab1b

[pre-commit.ci] auto fixes from pre-commit.com hooks

31498ef

for more information, see https://pre-commit.ci

update

31563dc

Merge branch 'apply_sync_fn' of https://github.com/PyTorchLightning/m…

7d24123

…etrics into apply_sync_fn

tchaton mentioned this pull request Jun 17, 2021

[feat] Add Logging Restoration on Failure 2/2 Lightning-AI/pytorch-lightning#7966

Merged

11 tasks

tchaton marked this pull request as ready for review June 17, 2021 18:58

tchaton requested review from ananyahjha93, Borda, justusschock and SkafteNicki as code owners June 17, 2021 18:58

justusschock reviewed Jun 18, 2021

View reviewed changes

torchmetrics/metric.py Outdated Show resolved Hide resolved

torchmetrics/metric.py Outdated Show resolved Hide resolved

torchmetrics/metric.py Outdated Show resolved Hide resolved

tchaton and others added 5 commits June 18, 2021 03:29

update on comments

15e6d9a

[pre-commit.ci] auto fixes from pre-commit.com hooks

b3d5ec5

for more information, see https://pre-commit.ci

Merge branch 'master' into apply_sync_fn

d4367db

update

fc42bbe

update

0dcb041

pre-commit-ci bot and others added 6 commits June 18, 2021 11:21

[pre-commit.ci] auto fixes from pre-commit.com hooks

303e829

for more information, see https://pre-commit.ci

update on comments

586ae75

Merge branch 'apply_sync_fn' of https://github.com/PyTorchLightning/m…

a460801

…etrics into apply_sync_fn

[pre-commit.ci] auto fixes from pre-commit.com hooks

409e20a

for more information, see https://pre-commit.ci

add missing is_distributed_fn

71fad52

Merge branch 'apply_sync_fn' of https://github.com/PyTorchLightning/m…

b7e2030

…etrics into apply_sync_fn

carmocca reviewed Jun 18, 2021

View reviewed changes

update on comments

e72de7d

carmocca approved these changes Jun 18, 2021

View reviewed changes

Update torchmetrics/metric.py

11a3ab8

tchaton added 2 commits June 18, 2021 09:28

resolve failing test

d9c0a53

Merge branch 'apply_sync_fn' of https://github.com/PyTorchLightning/m…

f37e77f

…etrics into apply_sync_fn

Borda enabled auto-merge (squash) June 18, 2021 13:33

Deepsource smells

6e7e3a8

awaelchli approved these changes Jun 19, 2021

View reviewed changes

Apply suggestions from code review

7ee31d0

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

tchaton changed the title ~~[feat] Add _apply_sync to nn.Metric~~ [feat] Add sync_context and sync to nn.Metric Jun 21, 2021

tchaton and others added 3 commits June 21, 2021 05:40

update

99d99e1

Merge branch 'apply_sync_fn' of https://github.com/PyTorchLightning/m…

0df9659

…etrics into apply_sync_fn

[pre-commit.ci] auto fixes from pre-commit.com hooks

9872ddd

for more information, see https://pre-commit.ci

Borda merged commit fc3333b into master Jun 21, 2021

Borda deleted the apply_sync_fn branch June 21, 2021 10:27

tchaton mentioned this pull request Jun 23, 2021

[Refactor] Remove additional condition #311

Merged

4 tasks

ananthsub reviewed Jun 29, 2021

View reviewed changes

awaelchli mentioned this pull request Jun 30, 2021

fix state_dict() hanging process with torchmetrics 0.4 Lightning-AI/pytorch-lightning#8218

Merged

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feat] Add sync_context and sync to nn.Metric #302

[feat] Add sync_context and sync to nn.Metric #302

tchaton commented Jun 17, 2021 •

edited

Loading

codecov bot commented Jun 17, 2021 •

edited

Loading

pep8speaks commented Jun 17, 2021 •

edited

Loading

SkafteNicki commented Jun 17, 2021

Borda commented Jun 18, 2021

tchaton commented Jun 18, 2021

Borda commented Jun 18, 2021

tchaton commented Jun 18, 2021

awaelchli Jun 19, 2021

tchaton Jun 21, 2021

awaelchli Jun 21, 2021

maximsch2 Jun 29, 2021

carmocca Jun 29, 2021

maximsch2 commented Jun 29, 2021

ananthsub left a comment

ananthsub Jun 29, 2021

carmocca Jun 29, 2021

ananthsub Jun 29, 2021

carmocca Jun 29, 2021

ananthsub Jun 29, 2021

carmocca Jun 29, 2021

[feat] Add sync_context and sync to nn.Metric #302

[feat] Add sync_context and sync to nn.Metric #302

Conversation

tchaton commented Jun 17, 2021 • edited Loading

What does this PR do?

Did you have fun?

codecov bot commented Jun 17, 2021 • edited Loading

Codecov Report

pep8speaks commented Jun 17, 2021 • edited Loading

Comment last updated at 2021-06-21 09:41:26 UTC

SkafteNicki commented Jun 17, 2021

Borda commented Jun 18, 2021

tchaton commented Jun 18, 2021

Borda commented Jun 18, 2021

tchaton commented Jun 18, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maximsch2 commented Jun 29, 2021

ananthsub left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tchaton commented Jun 17, 2021 •

edited

Loading

codecov bot commented Jun 17, 2021 •

edited

Loading

pep8speaks commented Jun 17, 2021 •

edited

Loading