Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test produces a warning when using DDP #12862

Closed
RuRo opened this issue Apr 23, 2022 · 11 comments
Closed

test produces a warning when using DDP #12862

RuRo opened this issue Apr 23, 2022 · 11 comments
Labels
question Further information is requested strategy: ddp DistributedDataParallel trainer: test won't fix This will not be worked on

Comments

@RuRo
Copy link
Contributor

RuRo commented Apr 23, 2022

Trying to trainer.test with multiple GPUs (or even when using a single GPU with DDPStrategy) produces the following warning:

PossibleUserWarning: Using `DistributedSampler` with the dataloaders. During `trainer.test()`,
it is recommended to use `Trainer(devices=1)` to ensure each sample/batch gets evaluated
exactly once. Otherwise, multi-device settings use `DistributedSampler` that replicates some
samples to make sure all devices have same batch size in case of uneven inputs.

The problem is that the warning doesn't adequately explain, how to fix this problem in all possible cases.

1. What if I am running trainer.test after trainer.fit?

Settings devices=1 in that case is not really a solution, because I want to use multiple GPUs for training. Creating a new Trainer instance also doesn't quite work, because that would create a separate experiment (AFAIK?). For example, ckpt_path="best" wouldn't work with a new Trainer instance, the Tensorboard logs will get segmented and so on.

Is it possible to use a different Strategy for tune, fit and test in a single Trainer? (btw, this might be useful even outside of this issue, as tune currently doesn't work well with DDP)

2. What if I don't care about DistributedSampler adding extra samples?

Please correct me, if I am wrong, but DistributedSampler should add at most num_devices - 1 extra samples. This means that unless you are using hundreds of devices or using extremely small datasets, the difference in metrics will probably be

a) less than the rounding precision
b) less than the natural fluctuations due to random initialization and non-deterministic CUDA shenanigans

I think that bothering users with such a minor issue isn't really desirable. Can this warning be silenced somehow?

3. Can this be fixed without requiring any changes from the users?

I found pytorch_lightning.overrides.distributed.UnrepeatedDistributedSampler, which allegedly solves this exact problem, but doesn't work for training.

Does UnrepeatedDistributedSampler solve this issue? If it does, I think it should be at least mentioned in the warning and at best - used automatically during test instead of warning the user.

cc @justusschock @kaushikb11 @awaelchli @akihironitta @rohitgr7

@RuRo RuRo added the needs triage Waiting to be triaged by maintainers label Apr 23, 2022
@carmocca carmocca added question Further information is requested strategy: ddp DistributedDataParallel and removed needs triage Waiting to be triaged by maintainers labels Apr 26, 2022
@carmocca
Copy link
Contributor

What if I am running trainer.test after trainer.fit?

Your observations are correct. This is why we recommend splitting your fitting and testing procedures into separate scripts when DDP is used for fitting.

Is it possible to use a different Strategy for tune, fit and test in a single Trainer?

No

I think that bothering users with such a minor issue isn't really desirable.

This can be impactful for the reproducibility of previous results.

Can this warning be silenced somehow?

Yes! It is mentioned in our docs:

import warnings

warnings.filterwarnings("ignore", ".*Consider increasing the value of the `num_workers` argument*")

# or to ignore all warnings that could be false positives
from pytorch_lightning.utilities.warnings import PossibleUserWarning

warnings.filterwarnings("ignore", category=PossibleUserWarning)

I think it should be at least mentioned in the warning and at best - used automatically during test instead of warning the user.

I believe we looked into this but are hesitant about enabling it by default - cc @kaushikb11

You should be able to import and use this sampler to try it out at the moment

@RuRo
Copy link
Contributor Author

RuRo commented Apr 26, 2022

Your observations are correct. This is why we recommend splitting your fitting and testing procedures into separate scripts when DDP is used for fitting.

Not sure what do you mean by "my observations are correct". My point was that splitting the training into separate fit and test scripts is undesirable because creating a new trainer starts a new experiment and therefore breaks some features of pytorch-lightning.

Is it possible to use a different Strategy for tune, fit and test in a single Trainer?

No

Should it be possible? As I've already mentioned, there are some cases, where this ability would be really useful (you may want to tune, test and/or validate your model on a different device or with a different number of devices and doing it in separate scripts is inconvenient or sometimes even impossible).

@awaelchli
Copy link
Contributor

awaelchli commented Apr 26, 2022

The problem is that the warning doesn't adequately explain, how to fix this problem in all possible cases.

Yes you are right, it does not. The reason is that some of these possibilities are non-trivial to handle, both for Lightning as well as the user. That is why we say the "dumb" thing of use devices=1, because that's the only situation where we can guarantee that metrics get computed properly.

What if I don't care about DistributedSampler adding extra samples?

In this case you're good! No action needs to be taken. The warning can be safely ignored. Should we say that in the warning itself?

I think that bothering users with such a minor issue isn't really desirable.

We do it because in Lightning we care about correctness and reproducibility. If the test metric varies between runs that use different number of devices, it raises questions that are hard to answer UNLESS you stumble over this post here or are deeply familiar with the internals of PyTorch and DDP.

Can this be fixed without requiring any changes from the users?

Right now, no! It's really unfortunate, I know. Ideally this would be solved by Lightning and the user doesn't have to do anything. For this, Lightning would need to support uneven inputs (with something like the unrepeated sampler you mention) AND support syncing the metric. The syncing of the metric is the tricky part here. Specifically, the syncing on_step=True which would lead to a deadlock if processes have different number of batches.
For more context, here is where we first discussed uneven inputs and DDP join: #5141
Here is a continued thread with more information: #3325

I believe that if some restrictions were in place, like, we say syncing on a per batch level is not allowed, and one used torchmetrics exclusively, then the unrepeated sampler + ddp could work. However, these restrictions might be too strong in general.

On a personal note, this incorrectness problem bothered me for a long time and I wanted to solve this issue since it emerged, but it is SO HARD!

@RuRo
Copy link
Contributor Author

RuRo commented Apr 26, 2022

No action needs to be taken. The warning can be safely ignored. Should we say that in the warning itself?

I don't know, how representative my opinion is of the wider community, but I take warnings very seriously. I have a zero warning policy, and I try to understand and fix all the warnings in my code.

To me, a warning is basically an Exception that didn't terminate the program. In some sense, a warning is worse than an exception, because a program that does the wrong thing is worse than the program that doesn't work (imo).

A warning indicates to the user, that they are probably doing something wrong. If there is no way for the user to fix the warning, then that warning is incorrect and is therefore a bug (a low priority bug, but nevertheless still a bug).

I know, that you can silence warnings with

warnings.filterwarnings("ignore", ...)

but I consider this approach to be bad practice.

Here are my reasons against manually filtering warnings:

  1. A warning that is a false-positive today may become a true-positive tomorrow (due to changes in another part of your code), and you wouldn't know about it because you silenced it
  2. It's possible to accidentally silence more warnings than you intended to silence
  3. If you match against the warning message and the message changes, the warning will appear again
  4. You have no way to know if/when it is safe to remove the filterwarnings

In my opinion, the "correct" way to fix a warning (as a user) is to change my program in such a way that the warning is not emitted. If there is no way to do so, then it is a bug in the library.


I think that bothering users with such a minor issue isn't really desirable.

We do it because in Lightning we care about correctness and reproducibility. If the test metric varies between runs that use different number of devices, it raises questions that are hard to answer UNLESS you stumble over this post here or are deeply familiar with the internals of PyTorch and DDP.

What do you think about only warning the user if deterministic=True? I would imagine, that when deterministic=False, the user isn't expecting reproducibility.

@carmocca
Copy link
Contributor

@RuRo I agree with your thoughts about warnings. This is why all these warnings are under our custom category PossibleUserWarning. Unfortunately, they are often false-positives but we prefer adding noise but letting people know than not communicating these potential issues.

At the moment, we don't have any ideas about how to improve this situation but feel free to mention any that you have.

What do you think about only warning the user if deterministic=True? I would imagine, that when deterministic=False, the user isn't expecting reproducibility.

deterministic just controls whether we set https://pytorch.org/docs/stable/generated/torch.use_deterministic_algorithms.html#torch.use_deterministic_algorithms which has a much more narrow scope.

@stale
Copy link

stale bot commented Jun 5, 2022

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

@stale stale bot added the won't fix This will not be worked on label Jun 5, 2022
@stale stale bot closed this as completed Jun 23, 2022
@sounakdey
Copy link

Just commenting because this is really important for the research community using Pytorch lightning ... to make sure the results are reproducible... Are we still looking into this?

@awaelchli
Copy link
Contributor

Hi @sounakdey
It is on our roadmap. The issue where this was discussed is here: #3325
If we get support for uneven inputs plus using torchmetrics to compute metrics, trainer.test will be gold and should produce the exact same output regardless of how many GPUs are used to run it.

@amorehead
Copy link
Contributor

amorehead commented Sep 17, 2022

I second @sounakdey here. For now, I'm working around this issue of not being able to run fit() and test() one after the other with multiple GPUs and a single GPU, respectively. Nonetheless, It would be very desirable to have this functionality in place so e.g., loggers like WandB can report training, validation, and testing results in the same experiment (i.e., run).

@Hannibal046
Copy link

hi, any progress on this ? I am currently a newbie for lightning. I am working on language generation task. The metric i use is BLEU which is a corpus level metric:

# hyps: a list of string
# refs: a list of string
score = get_bleu_score(hyps,refs)

so in ddp test setting, i have to do the following:

  1. unevenly and sequentially distribute test data to multiple GPU to generate transltions

  2. gather generated transltions from multiple GPU to rank0

  3. calculate bleu score on rank0

To perform 2 and 3, i think this will do (not sure if self.all_gather could gather non-tensor object, but mp.all_gather_object will do):

def test_step(self, batch, batch_idx):
    hyp, ref = batch
    transltions = self.generate(hyp)
    return hyp,ref


def test_epoch_end(self, outputs):
    all_hyps,all_refs = self.all_gather(outputs)
    if self.trainer.is_global_zero:
      	score = get_bleu_score(all_hyps,all_refs)
        self.log("BLEU", score, rank_zero_only=True)

However, when i want to unevenly and sequentially distribute test data to multiple GPU, i don't know what to do. here are some naive solutions i came up with:

  1. define a UnevenSequentialDistributedSampler like this for test dataloader, which needs num_replicas and rank, i don't know how to get them in the LightningDataModule. Also, this requires me to define sampler for train_dataloader and how to make it compatible with single-gpu training is a problem.
class UnevenSequentialDistributedSampler(torch.utils.data.sampler.Sampler):
    """
    This is slightly different version of SequentialDistrbitedSample from 
    https://github.com/huggingface/transformers/blob/81ac45f85c35244831f11f73c09ea10eee4f953a/src/transformers/trainer_pt_utils.py
    In thie version, the datset is not evenly split. Since we don't need tensors of same shape to reduce or gather
    """

    def __init__(self, dataset, num_replicas=None, rank=None):
        import math
        if num_replicas is None:
            if not torch.distributed.is_available():
                raise RuntimeError("Requires distributed package to be available")
            num_replicas = torch.distributed.get_world_size()
        if rank is None:
            if not torch.distributed.is_available():
                raise RuntimeError("Requires distributed package to be available")
            rank = torch.distributed.get_rank()
        self.dataset = dataset
        self.num_replicas = num_replicas
        self.rank = rank
        self.num_samples = int(math.ceil(len(self.dataset) * 1.0 / self.num_replicas))
        self.total_size = self.num_samples * self.num_replicas 
        indices = list(range(len(self.dataset)))
        self.indices = indices[self.rank * self.num_samples : (self.rank + 1) * self.num_samples] ## a trick for python list ls[:infinity]

    def __iter__(self):
        return iter(self.indices)

    def __len__(self):
        return len(self.indices)
  1. get the number of val and test data in LightningModule, and manually truncation in the test_epoch_end
def test_epoch_end(self, outputs):
    all_hyps,all_refs = self.all_gather(outputs)
    
    all_hyps = all_hyps[:self.number_of_test_samples]
    all_refs = all_refs[:self.number_of_test_samples]
    
    if self.trainer.is_global_zero:
      	score = get_bleu_score(all_hyps,all_refs)
        self.log("BLEU", score, rank_zero_only=True)

I don't konw what is the best practice to do, could you please help me out ?

@mfoglio
Copy link

mfoglio commented May 1, 2024

Hello, is there any progress on this? Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested strategy: ddp DistributedDataParallel trainer: test won't fix This will not be worked on
Projects
None yet
Development

No branches or pull requests

8 participants