`test` produces a warning when using `DDP` #12862

RuRo · 2022-04-23T01:05:06Z

Trying to trainer.test with multiple GPUs (or even when using a single GPU with DDPStrategy) produces the following warning:

PossibleUserWarning: Using `DistributedSampler` with the dataloaders. During `trainer.test()`,
it is recommended to use `Trainer(devices=1)` to ensure each sample/batch gets evaluated
exactly once. Otherwise, multi-device settings use `DistributedSampler` that replicates some
samples to make sure all devices have same batch size in case of uneven inputs.

The problem is that the warning doesn't adequately explain, how to fix this problem in all possible cases.

1. What if I am running `trainer.test` after `trainer.fit`?

Settings devices=1 in that case is not really a solution, because I want to use multiple GPUs for training. Creating a new Trainer instance also doesn't quite work, because that would create a separate experiment (AFAIK?). For example, ckpt_path="best" wouldn't work with a new Trainer instance, the Tensorboard logs will get segmented and so on.

Is it possible to use a different Strategy for tune, fit and test in a single Trainer? (btw, this might be useful even outside of this issue, as tune currently doesn't work well with DDP)

2. What if I don't care about `DistributedSampler` adding extra samples?

Please correct me, if I am wrong, but DistributedSampler should add at most num_devices - 1 extra samples. This means that unless you are using hundreds of devices or using extremely small datasets, the difference in metrics will probably be

a) less than the rounding precision
b) less than the natural fluctuations due to random initialization and non-deterministic CUDA shenanigans

I think that bothering users with such a minor issue isn't really desirable. Can this warning be silenced somehow?

3. Can this be fixed without requiring any changes from the users?

I found pytorch_lightning.overrides.distributed.UnrepeatedDistributedSampler, which allegedly solves this exact problem, but doesn't work for training.

Does UnrepeatedDistributedSampler solve this issue? If it does, I think it should be at least mentioned in the warning and at best - used automatically during test instead of warning the user.

cc @justusschock @kaushikb11 @awaelchli @akihironitta @rohitgr7

The text was updated successfully, but these errors were encountered:

carmocca · 2022-04-26T13:38:24Z

What if I am running trainer.test after trainer.fit?

Your observations are correct. This is why we recommend splitting your fitting and testing procedures into separate scripts when DDP is used for fitting.

Is it possible to use a different Strategy for tune, fit and test in a single Trainer?

No

I think that bothering users with such a minor issue isn't really desirable.

This can be impactful for the reproducibility of previous results.

Can this warning be silenced somehow?

Yes! It is mentioned in our docs:

import warnings

warnings.filterwarnings("ignore", ".*Consider increasing the value of the `num_workers` argument*")

# or to ignore all warnings that could be false positives
from pytorch_lightning.utilities.warnings import PossibleUserWarning

warnings.filterwarnings("ignore", category=PossibleUserWarning)

I think it should be at least mentioned in the warning and at best - used automatically during test instead of warning the user.

I believe we looked into this but are hesitant about enabling it by default - cc @kaushikb11

You should be able to import and use this sampler to try it out at the moment

RuRo · 2022-04-26T15:10:38Z

Your observations are correct. This is why we recommend splitting your fitting and testing procedures into separate scripts when DDP is used for fitting.

Not sure what do you mean by "my observations are correct". My point was that splitting the training into separate fit and test scripts is undesirable because creating a new trainer starts a new experiment and therefore breaks some features of pytorch-lightning.

Is it possible to use a different Strategy for tune, fit and test in a single Trainer?

No

Should it be possible? As I've already mentioned, there are some cases, where this ability would be really useful (you may want to tune, test and/or validate your model on a different device or with a different number of devices and doing it in separate scripts is inconvenient or sometimes even impossible).

awaelchli · 2022-04-26T16:07:49Z

The problem is that the warning doesn't adequately explain, how to fix this problem in all possible cases.

Yes you are right, it does not. The reason is that some of these possibilities are non-trivial to handle, both for Lightning as well as the user. That is why we say the "dumb" thing of use devices=1, because that's the only situation where we can guarantee that metrics get computed properly.

What if I don't care about DistributedSampler adding extra samples?

In this case you're good! No action needs to be taken. The warning can be safely ignored. Should we say that in the warning itself?

I think that bothering users with such a minor issue isn't really desirable.

We do it because in Lightning we care about correctness and reproducibility. If the test metric varies between runs that use different number of devices, it raises questions that are hard to answer UNLESS you stumble over this post here or are deeply familiar with the internals of PyTorch and DDP.

Can this be fixed without requiring any changes from the users?

Right now, no! It's really unfortunate, I know. Ideally this would be solved by Lightning and the user doesn't have to do anything. For this, Lightning would need to support uneven inputs (with something like the unrepeated sampler you mention) AND support syncing the metric. The syncing of the metric is the tricky part here. Specifically, the syncing on_step=True which would lead to a deadlock if processes have different number of batches.
For more context, here is where we first discussed uneven inputs and DDP join: #5141
Here is a continued thread with more information: #3325

I believe that if some restrictions were in place, like, we say syncing on a per batch level is not allowed, and one used torchmetrics exclusively, then the unrepeated sampler + ddp could work. However, these restrictions might be too strong in general.

On a personal note, this incorrectness problem bothered me for a long time and I wanted to solve this issue since it emerged, but it is SO HARD!

RuRo · 2022-04-26T17:12:41Z

No action needs to be taken. The warning can be safely ignored. Should we say that in the warning itself?

I don't know, how representative my opinion is of the wider community, but I take warnings very seriously. I have a zero warning policy, and I try to understand and fix all the warnings in my code.

To me, a warning is basically an Exception that didn't terminate the program. In some sense, a warning is worse than an exception, because a program that does the wrong thing is worse than the program that doesn't work (imo).

A warning indicates to the user, that they are probably doing something wrong. If there is no way for the user to fix the warning, then that warning is incorrect and is therefore a bug (a low priority bug, but nevertheless still a bug).

I know, that you can silence warnings with

warnings.filterwarnings("ignore", ...)

but I consider this approach to be bad practice.

Here are my reasons against manually filtering warnings:

A warning that is a false-positive today may become a true-positive tomorrow (due to changes in another part of your code), and you wouldn't know about it because you silenced it
It's possible to accidentally silence more warnings than you intended to silence
If you match against the warning message and the message changes, the warning will appear again
You have no way to know if/when it is safe to remove the filterwarnings

In my opinion, the "correct" way to fix a warning (as a user) is to change my program in such a way that the warning is not emitted. If there is no way to do so, then it is a bug in the library.

I think that bothering users with such a minor issue isn't really desirable.

We do it because in Lightning we care about correctness and reproducibility. If the test metric varies between runs that use different number of devices, it raises questions that are hard to answer UNLESS you stumble over this post here or are deeply familiar with the internals of PyTorch and DDP.

What do you think about only warning the user if deterministic=True? I would imagine, that when deterministic=False, the user isn't expecting reproducibility.

carmocca · 2022-04-28T17:12:29Z

@RuRo I agree with your thoughts about warnings. This is why all these warnings are under our custom category PossibleUserWarning. Unfortunately, they are often false-positives but we prefer adding noise but letting people know than not communicating these potential issues.

At the moment, we don't have any ideas about how to improve this situation but feel free to mention any that you have.

What do you think about only warning the user if deterministic=True? I would imagine, that when deterministic=False, the user isn't expecting reproducibility.

deterministic just controls whether we set https://pytorch.org/docs/stable/generated/torch.use_deterministic_algorithms.html#torch.use_deterministic_algorithms which has a much more narrow scope.

stale · 2022-06-05T20:09:42Z

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

sounakdey · 2022-08-04T10:30:56Z

Just commenting because this is really important for the research community using Pytorch lightning ... to make sure the results are reproducible... Are we still looking into this?

awaelchli · 2022-08-04T13:36:56Z

Hi @sounakdey
It is on our roadmap. The issue where this was discussed is here: #3325
If we get support for uneven inputs plus using torchmetrics to compute metrics, trainer.test will be gold and should produce the exact same output regardless of how many GPUs are used to run it.

amorehead · 2022-09-17T01:15:39Z

I second @sounakdey here. For now, I'm working around this issue of not being able to run fit() and test() one after the other with multiple GPUs and a single GPU, respectively. Nonetheless, It would be very desirable to have this functionality in place so e.g., loggers like WandB can report training, validation, and testing results in the same experiment (i.e., run).

Hannibal046 · 2022-11-07T07:34:13Z

hi, any progress on this ? I am currently a newbie for lightning. I am working on language generation task. The metric i use is BLEU which is a corpus level metric:

# hyps: a list of string
# refs: a list of string
score = get_bleu_score(hyps,refs)

so in ddp test setting, i have to do the following:

unevenly and sequentially distribute test data to multiple GPU to generate transltions
gather generated transltions from multiple GPU to rank0
calculate bleu score on rank0

To perform 2 and 3, i think this will do (not sure if self.all_gather could gather non-tensor object, but mp.all_gather_object will do):

def test_step(self, batch, batch_idx):
    hyp, ref = batch
    transltions = self.generate(hyp)
    return hyp,ref


def test_epoch_end(self, outputs):
    all_hyps,all_refs = self.all_gather(outputs)
    if self.trainer.is_global_zero:
      	score = get_bleu_score(all_hyps,all_refs)
        self.log("BLEU", score, rank_zero_only=True)

However, when i want to unevenly and sequentially distribute test data to multiple GPU, i don't know what to do. here are some naive solutions i came up with:

define a UnevenSequentialDistributedSampler like this for test dataloader, which needs num_replicas and rank, i don't know how to get them in the LightningDataModule. Also, this requires me to define sampler for train_dataloader and how to make it compatible with single-gpu training is a problem.

class UnevenSequentialDistributedSampler(torch.utils.data.sampler.Sampler):
    """
    This is slightly different version of SequentialDistrbitedSample from 
    https://github.com/huggingface/transformers/blob/81ac45f85c35244831f11f73c09ea10eee4f953a/src/transformers/trainer_pt_utils.py
    In thie version, the datset is not evenly split. Since we don't need tensors of same shape to reduce or gather
    """

    def __init__(self, dataset, num_replicas=None, rank=None):
        import math
        if num_replicas is None:
            if not torch.distributed.is_available():
                raise RuntimeError("Requires distributed package to be available")
            num_replicas = torch.distributed.get_world_size()
        if rank is None:
            if not torch.distributed.is_available():
                raise RuntimeError("Requires distributed package to be available")
            rank = torch.distributed.get_rank()
        self.dataset = dataset
        self.num_replicas = num_replicas
        self.rank = rank
        self.num_samples = int(math.ceil(len(self.dataset) * 1.0 / self.num_replicas))
        self.total_size = self.num_samples * self.num_replicas 
        indices = list(range(len(self.dataset)))
        self.indices = indices[self.rank * self.num_samples : (self.rank + 1) * self.num_samples] ## a trick for python list ls[:infinity]

    def __iter__(self):
        return iter(self.indices)

    def __len__(self):
        return len(self.indices)

get the number of val and test data in LightningModule, and manually truncation in the test_epoch_end

def test_epoch_end(self, outputs):
    all_hyps,all_refs = self.all_gather(outputs)
    
    all_hyps = all_hyps[:self.number_of_test_samples]
    all_refs = all_refs[:self.number_of_test_samples]
    
    if self.trainer.is_global_zero:
      	score = get_bleu_score(all_hyps,all_refs)
        self.log("BLEU", score, rank_zero_only=True)

I don't konw what is the best practice to do, could you please help me out ?

mfoglio · 2024-05-01T20:05:26Z

Hello, is there any progress on this? Thank you!

RuRo added the needs triage Waiting to be triaged by maintainers label Apr 23, 2022

carmocca added question Further information is requested strategy: ddp DistributedDataParallel and removed needs triage Waiting to be triaged by maintainers labels Apr 26, 2022

akihironitta added the trainer: test label May 3, 2022

stale bot added the won't fix This will not be worked on label Jun 5, 2022

stale bot closed this as completed Jun 23, 2022

tesfaldet mentioned this issue Aug 17, 2023

DDP-related improvements to data module and logging ashleve/lightning-hydra-template#594

Merged

4 tasks

transmissions11 added a commit to transmissions11/bistro that referenced this issue Sep 24, 2023

see: Lightning-AI/pytorch-lightning#12862

ea51ffa

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`test` produces a warning when using `DDP` #12862

`test` produces a warning when using `DDP` #12862

RuRo commented Apr 23, 2022 •

edited by github-actions bot

Loading

carmocca commented Apr 26, 2022

RuRo commented Apr 26, 2022

awaelchli commented Apr 26, 2022 •

edited

Loading

RuRo commented Apr 26, 2022

carmocca commented Apr 28, 2022

stale bot commented Jun 5, 2022

sounakdey commented Aug 4, 2022

awaelchli commented Aug 4, 2022

amorehead commented Sep 17, 2022 •

edited

Loading

Hannibal046 commented Nov 7, 2022

mfoglio commented May 1, 2024

test produces a warning when using DDP #12862

test produces a warning when using DDP #12862

Comments

RuRo commented Apr 23, 2022 • edited by github-actions bot Loading

1. What if I am running trainer.test after trainer.fit?

2. What if I don't care about DistributedSampler adding extra samples?

3. Can this be fixed without requiring any changes from the users?

carmocca commented Apr 26, 2022

RuRo commented Apr 26, 2022

awaelchli commented Apr 26, 2022 • edited Loading

RuRo commented Apr 26, 2022

carmocca commented Apr 28, 2022

stale bot commented Jun 5, 2022

sounakdey commented Aug 4, 2022

awaelchli commented Aug 4, 2022

amorehead commented Sep 17, 2022 • edited Loading

Hannibal046 commented Nov 7, 2022

mfoglio commented May 1, 2024

`test` produces a warning when using `DDP` #12862

`test` produces a warning when using `DDP` #12862

RuRo commented Apr 23, 2022 •

edited by github-actions bot

Loading

1. What if I am running `trainer.test` after `trainer.fit`?

2. What if I don't care about `DistributedSampler` adding extra samples?

awaelchli commented Apr 26, 2022 •

edited

Loading

amorehead commented Sep 17, 2022 •

edited

Loading