DDP-related improvements to data module and logging #594

tesfaldet · 2023-08-17T18:47:00Z

What does this PR do?

Changed the pylogger to be rank-aware, such that logged messages will have the rank of the process prefixed. It also has the capability of logging to a specific rank if the user wishes to, thereby covering the previous pylogger's use-case of rank-zero logging while providing greater logging flexibility. By default, the new pylogger is unrestricted in the ranks it logs to so as to provide greater clarity as to which process the current log is being executed in (e.g., when instantiating models and such, it's useful to know that it's happening on multiple processes).
The MNIST DataModule's batch size is now divided by the number of processes used in a DDP setup, so as to keep training dynamics more consistent/comparable when running multiple training scripts, each with a different number of devices.
The log file saved by the Hydra colorlog plugin is now consistent across devices when training in a DDP setup. This means that all processes will log to the same file, and because the new pylogger provides rank information, it is easy for the user to tell from which process the log came from. This is in contrast to before where one file called train.log would be created for the rank 0 process and train_ddp_process_{rank}.log would be created for all the other ranks, making it confusing to read through logs in a DDP setup.

Before submitting

Did you make sure title is self-explanatory and the description concisely explains the PR?
[~] Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you list all the breaking changes introduced by this pull request?
Did you test your PR locally with pytest command?
Did you run pre-commit hooks with pre-commit run -a command?

Did you have fun?

y

Make sure you had fun coding 🙃

codecov-commenter · 2023-08-17T18:54:03Z

Codecov Report

Patch coverage: 80.00% and project coverage change: -0.51% ⚠️

Comparison is base (8055898) 83.75% compared to head (462909d) 83.24%.
Report is 4 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #594      +/-   ##
==========================================
- Coverage   83.75%   83.24%   -0.51%     
==========================================
  Files          11       11              
  Lines         357      376      +19     
==========================================
+ Hits          299      313      +14     
- Misses         58       63       +5

Files Changed	Coverage Δ
src/eval.py	`87.50% <66.66%> (ø)`
src/utils/pylogger.py	`77.27% <76.19%> (-22.73%)`	⬇️
src/data/mnist_datamodule.py	`93.02% <80.00%> (-1.72%)`	⬇️
src/train.py	`96.00% <83.33%> (+3.54%)`	⬆️
src/utils/__init__.py	`100.00% <100.00%> (ø)`
src/utils/instantiators.py	`80.64% <100.00%> (ø)`
src/utils/logging_utils.py	`25.00% <100.00%> (ø)`
src/utils/rich_utils.py	`82.60% <100.00%> (ø)`
src/utils/utils.py	`72.09% <100.00%> (ø)`

... and 1 file with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

tesfaldet · 2023-08-17T19:00:37Z

Seems like the checks are using python 3.8. Is there a way to make them use 3.10?

tesfaldet · 2023-08-17T19:55:13Z

Also I just realized I might need to take this issue into account Lightning-AI/pytorch-lightning#12862. This affects the train script because trainer.test is called right after trainer.fit is finished. If the trainer was initialized with the DDP strategy, then when test starts it will give a warning:

lightning/pytorch/trainer/connectors/data_connector.py:226: PossibleUserWarning: Using `DistributedSampler` with the dataloaders. During `trainer.test()`, it is recommended to use `Trainer(devices=1, num_nodes=1)` to ensure each sample/batch gets evaluated exactly once. Otherwise, multi-device settings use `DistributedSampler` that replicates some samples to make sure all devices have same batch size in case of uneven inputs.

I think this warning is quite important to deal with since we wouldn't want inaccurate test metrics to be reported because of accidentally re-using the same trainer that was initialized to use a DDP strategy. A possible solution is to initialize a new trainer separately for testing. The only issue I can think of with this approach is that it will create a new Logger for testing, meaning that you won't have all your train, validation, and test results neatly presented in a single log (e.g., a single TensorBoard log, or a single WandB log, or a single Aim log). Lightning doesn't save Logger objects in checkpoints, meaning they can't be restored from checkpoints (although I'm aware of your PR for this, but it's specific to the Tensorboard Logger it seems).

ashleve

I like this new logger. What's the default way for user to limit the logging to only master rank? Can you add something like optional log_master_only arg to get_ranked_pylogger(...)?
Sure, we can have this
Good idea. Are you sure logging to the same file from many processes doesn't need synchronisation? Won't there by any conflicts leading to logs getting lost sometimes?

ashleve · 2023-08-24T21:11:39Z

tesfaldet · 2023-09-11T16:11:53Z

Getting to it now! Sorry for the hold up.

tesfaldet · 2023-09-11T16:28:16Z

I like this new logger. What's the default way for user to limit the logging to only master rank? Can you add something like optional log_master_only arg to get_ranked_pylogger(...)?

Sure, we can have this

Good idea. Are you sure logging to the same file from many processes doesn't need synchronisation? Won't there by any conflicts leading to logs getting lost sometimes?

Thanks! I can certainly provide an optional argument to get_ranked_pylogger(...) to restrict ranking to rank zero, so that the user doesn't always have to keep providing rank=0 to the log function. How does rank_zero_only sound? That should keep things in-line with naming convention across Lightning so the user immediately knows what the optional arg means.
👍
So far in my hundreds of DDP experiments I haven't seen any issues regarding file writes to the same log file from multiple process. Even from hundreds of processes at a time. However, this is just anecdotal. Let me see if I can find code that shows us that there's some sync or atomized file-writing going on under the hood. I'm guessing I'm going to have to look at logging.FileHandler as that's what it looks like Hydra is using. I'll report back.

ashleve · 2023-09-13T19:56:48Z

yea rank_zero_only is better

…ter, avoiding weird hijacking of logging functions.

tesfaldet · 2023-09-14T23:42:55Z

Added rank_zero_only :) I also changed the implementation since I was having issues with the previous implementation I had.

ashleve

LGTM

tesfaldet · 2023-09-19T13:41:59Z

I forgot to report back about the multiprocessing logging concern. In short, for the context of distributed logging where it's ok to have interleaving logs from multiple processes (i.e., you're not expecting a guaranteed ordering of logs across processes), then all seems to be ok. Check this stack overflow post for more info https://stackoverflow.com/questions/12238848/python-multiprocessinglogging-filehandler. Furthermore, there was some discussions surrounding Hydra configuring its logging setup to support logging in a distributed setting, but it didn't result in any concrete change facebookresearch/hydra#1148. From what I and others have experienced, nothing bad has happened to our logs when logging from multiple processes to a single file.

That being said, it might be worth looking into the officially-recommended way of logging to a single file from multiple processes: https://docs.python.org/3/howto/logging-cookbook.html#logging-to-a-single-file-from-multiple-processes
(that shouldn't be the responsibility of this template though...)

tesfaldet added 3 commits August 17, 2023 11:05

Dividing batch size by number of devices in MNISTDataModule's setup fn.

f50b9f2

log file is now the same across devices when training in a DDP setting.

68446ce

Rank-aware pylogger.

3f2ad96

ashleve requested changes Aug 24, 2023

View reviewed changes

tesfaldet added 2 commits September 14, 2023 16:10

Recreated the rank-aware logger so that it's making use of LoggerAdap…

abd0848

…ter, avoiding weird hijacking of logging functions.

Adding 'extra' kwarg to RankedLogger.

462909d

tesfaldet requested a review from ashleve September 18, 2023 13:58

ashleve approved these changes Sep 18, 2023

View reviewed changes

ashleve merged commit 1fb5405 into ashleve:main Sep 18, 2023
11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DDP-related improvements to data module and logging #594

DDP-related improvements to data module and logging #594

tesfaldet commented Aug 17, 2023

codecov-commenter commented Aug 17, 2023 •

edited

Loading

tesfaldet commented Aug 17, 2023

tesfaldet commented Aug 17, 2023

ashleve left a comment •

edited

Loading

ashleve commented Aug 24, 2023

tesfaldet commented Sep 11, 2023

tesfaldet commented Sep 11, 2023

ashleve commented Sep 13, 2023

tesfaldet commented Sep 14, 2023

ashleve left a comment

tesfaldet commented Sep 19, 2023

DDP-related improvements to data module and logging #594

DDP-related improvements to data module and logging #594

Conversation

tesfaldet commented Aug 17, 2023

What does this PR do?

Before submitting

Did you have fun?

codecov-commenter commented Aug 17, 2023 • edited Loading

Codecov Report

tesfaldet commented Aug 17, 2023

tesfaldet commented Aug 17, 2023

ashleve left a comment • edited Loading

Choose a reason for hiding this comment

ashleve commented Aug 24, 2023

tesfaldet commented Sep 11, 2023

tesfaldet commented Sep 11, 2023

ashleve commented Sep 13, 2023

tesfaldet commented Sep 14, 2023

ashleve left a comment

Choose a reason for hiding this comment

tesfaldet commented Sep 19, 2023

codecov-commenter commented Aug 17, 2023 •

edited

Loading

ashleve left a comment •

edited

Loading