Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DDP-related improvements to data module and logging #594

Merged
merged 5 commits into from
Sep 18, 2023

Conversation

tesfaldet
Copy link
Contributor

What does this PR do?

  1. Changed the pylogger to be rank-aware, such that logged messages will have the rank of the process prefixed. It also has the capability of logging to a specific rank if the user wishes to, thereby covering the previous pylogger's use-case of rank-zero logging while providing greater logging flexibility. By default, the new pylogger is unrestricted in the ranks it logs to so as to provide greater clarity as to which process the current log is being executed in (e.g., when instantiating models and such, it's useful to know that it's happening on multiple processes).
  2. The MNIST DataModule's batch size is now divided by the number of processes used in a DDP setup, so as to keep training dynamics more consistent/comparable when running multiple training scripts, each with a different number of devices.
  3. The log file saved by the Hydra colorlog plugin is now consistent across devices when training in a DDP setup. This means that all processes will log to the same file, and because the new pylogger provides rank information, it is easy for the user to tell from which process the log came from. This is in contrast to before where one file called train.log would be created for the rank 0 process and train_ddp_process_{rank}.log would be created for all the other ranks, making it confusing to read through logs in a DDP setup.

Before submitting

  • Did you make sure title is self-explanatory and the description concisely explains the PR?
  • [~] Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you list all the breaking changes introduced by this pull request?
  • Did you test your PR locally with pytest command?
  • Did you run pre-commit hooks with pre-commit run -a command?

Did you have fun?

y

Make sure you had fun coding 🙃

@codecov-commenter
Copy link

codecov-commenter commented Aug 17, 2023

Codecov Report

Patch coverage: 80.00% and project coverage change: -0.51% ⚠️

Comparison is base (8055898) 83.75% compared to head (462909d) 83.24%.
Report is 4 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #594      +/-   ##
==========================================
- Coverage   83.75%   83.24%   -0.51%     
==========================================
  Files          11       11              
  Lines         357      376      +19     
==========================================
+ Hits          299      313      +14     
- Misses         58       63       +5     
Files Changed Coverage Δ
src/eval.py 87.50% <66.66%> (ø)
src/utils/pylogger.py 77.27% <76.19%> (-22.73%) ⬇️
src/data/mnist_datamodule.py 93.02% <80.00%> (-1.72%) ⬇️
src/train.py 96.00% <83.33%> (+3.54%) ⬆️
src/utils/__init__.py 100.00% <100.00%> (ø)
src/utils/instantiators.py 80.64% <100.00%> (ø)
src/utils/logging_utils.py 25.00% <100.00%> (ø)
src/utils/rich_utils.py 82.60% <100.00%> (ø)
src/utils/utils.py 72.09% <100.00%> (ø)

... and 1 file with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@tesfaldet
Copy link
Contributor Author

Seems like the checks are using python 3.8. Is there a way to make them use 3.10?

@tesfaldet
Copy link
Contributor Author

Also I just realized I might need to take this issue into account Lightning-AI/pytorch-lightning#12862. This affects the train script because trainer.test is called right after trainer.fit is finished. If the trainer was initialized with the DDP strategy, then when test starts it will give a warning:

lightning/pytorch/trainer/connectors/data_connector.py:226: PossibleUserWarning: Using `DistributedSampler` with the dataloaders. During `trainer.test()`, it is recommended to use `Trainer(devices=1, num_nodes=1)` to ensure each sample/batch gets evaluated exactly once. Otherwise, multi-device settings use `DistributedSampler` that replicates some samples to make sure all devices have same batch size in case of uneven inputs.

I think this warning is quite important to deal with since we wouldn't want inaccurate test metrics to be reported because of accidentally re-using the same trainer that was initialized to use a DDP strategy. A possible solution is to initialize a new trainer separately for testing. The only issue I can think of with this approach is that it will create a new Logger for testing, meaning that you won't have all your train, validation, and test results neatly presented in a single log (e.g., a single TensorBoard log, or a single WandB log, or a single Aim log). Lightning doesn't save Logger objects in checkpoints, meaning they can't be restored from checkpoints (although I'm aware of your PR for this, but it's specific to the Tensorboard Logger it seems).

Copy link
Owner

@ashleve ashleve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I like this new logger. What's the default way for user to limit the logging to only master rank? Can you add something like optional log_master_only arg to get_ranked_pylogger(...)?
  2. Sure, we can have this
  3. Good idea. Are you sure logging to the same file from many processes doesn't need synchronisation? Won't there by any conflicts leading to logs getting lost sometimes?

@ashleve
Copy link
Owner

ashleve commented Aug 24, 2023

image

@tesfaldet
Copy link
Contributor Author

Getting to it now! Sorry for the hold up.

@tesfaldet
Copy link
Contributor Author

  1. I like this new logger. What's the default way for user to limit the logging to only master rank? Can you add something like optional log_master_only arg to get_ranked_pylogger(...)?
  2. Sure, we can have this
  3. Good idea. Are you sure logging to the same file from many processes doesn't need synchronisation? Won't there by any conflicts leading to logs getting lost sometimes?
  1. Thanks! I can certainly provide an optional argument to get_ranked_pylogger(...) to restrict ranking to rank zero, so that the user doesn't always have to keep providing rank=0 to the log function. How does rank_zero_only sound? That should keep things in-line with naming convention across Lightning so the user immediately knows what the optional arg means.

  2. 👍

  3. So far in my hundreds of DDP experiments I haven't seen any issues regarding file writes to the same log file from multiple process. Even from hundreds of processes at a time. However, this is just anecdotal. Let me see if I can find code that shows us that there's some sync or atomized file-writing going on under the hood. I'm guessing I'm going to have to look at logging.FileHandler as that's what it looks like Hydra is using. I'll report back.

@ashleve
Copy link
Owner

ashleve commented Sep 13, 2023

yea rank_zero_only is better

@tesfaldet
Copy link
Contributor Author

Added rank_zero_only :) I also changed the implementation since I was having issues with the previous implementation I had.

@tesfaldet tesfaldet requested a review from ashleve September 18, 2023 13:58
Copy link
Owner

@ashleve ashleve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ashleve ashleve merged commit 1fb5405 into ashleve:main Sep 18, 2023
11 checks passed
@tesfaldet
Copy link
Contributor Author

I forgot to report back about the multiprocessing logging concern. In short, for the context of distributed logging where it's ok to have interleaving logs from multiple processes (i.e., you're not expecting a guaranteed ordering of logs across processes), then all seems to be ok. Check this stack overflow post for more info https://stackoverflow.com/questions/12238848/python-multiprocessinglogging-filehandler. Furthermore, there was some discussions surrounding Hydra configuring its logging setup to support logging in a distributed setting, but it didn't result in any concrete change facebookresearch/hydra#1148. From what I and others have experienced, nothing bad has happened to our logs when logging from multiple processes to a single file.

That being said, it might be worth looking into the officially-recommended way of logging to a single file from multiple processes: https://docs.python.org/3/howto/logging-cookbook.html#logging-to-a-single-file-from-multiple-processes
(that shouldn't be the responsibility of this template though...)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants