Add docstrings and the parameter to `row_groups_per_part` to the MerlinDataLoader class #590

sararb · 2022-12-29T16:26:48Z

Fixes #550

@bbozkaya runs different tests (see image below) of repartitioning a parquet file (using pandas or cudf) and it seems that MerlinDataLoader always loads the dataset files with 1 partition even though we partition to multiple groups when saving the parquet file (as recommended here). To take into account these partitions, we should pass the parameter row_groups_per_part=True to the merlin.io.Dataset.

Goals ⚽

Add the parameter row_groups_per_part to MerlinDataLoader so as to load the dataset with the correct partitions.
Add docstrings to the MerlinDataLoader to explain the different parameters.
Add a user warning to ensure that dataset's partitions are divisible by the number of GPUs for DDP training. This is needed to ensure optimal performance by equally distributing the data among available GPUs.

github-actions · 2022-12-29T16:36:36Z

Documentation preview

https://nvidia-merlin.github.io/Transformers4Rec/review/pr-590

bbozkaya

I tested on 2 GPUs. It works fine when loading partitioned train data. For validation whoever, it seems to be using only 1 GPU and 1 partition. Is this expected or does it also need to be addressed?

sararb · 2023-01-10T13:17:48Z

I tested on 2 GPUs. It works fine when loading partitioned train data. For validation whoever, it seems to be using only 1 GPU and 1 partition. Is this expected or does it also need to be addressed?

Thank you for testing the solution! For the validation step, we rely on how HF transformers are setting the DDP training, and it seems that they don't wrap the model in a DDP mode for evaluation (training=False) (here). So it is expected that the validation runs on a single GPU but I don't know the motivation behind. I posted a question on HF forum to better understand the behavior of the Trainer in DDP+evaluation mode.

Add row_groups_per_part=True to preserve data partitions

8645745

sararb added enhancement New feature or request Multi-GPU labels Dec 29, 2022

sararb added this to the Merlin 23.01 milestone Dec 29, 2022

sararb requested review from gabrielspmoreira, rnyak and bbozkaya December 29, 2022 16:26

sararb self-assigned this Dec 29, 2022

rnyak added 2 commits January 5, 2023 12:45

Merge branch 'main' into fix-data-repartition

4a44434

Merge branch 'main' into fix-data-repartition

5585f31

bbozkaya approved these changes Jan 6, 2023

View reviewed changes

Merge branch 'main' into fix-data-repartition

13d35eb

bbozkaya merged commit 95d0a6e into main Jan 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add docstrings and the parameter to `row_groups_per_part` to the MerlinDataLoader class #590

Add docstrings and the parameter to `row_groups_per_part` to the MerlinDataLoader class #590

sararb commented Dec 29, 2022

github-actions bot commented Dec 29, 2022

bbozkaya left a comment

sararb commented Jan 10, 2023

Add docstrings and the parameter to row_groups_per_part to the MerlinDataLoader class #590

Add docstrings and the parameter to row_groups_per_part to the MerlinDataLoader class #590

Conversation

sararb commented Dec 29, 2022

Goals ⚽

github-actions bot commented Dec 29, 2022

Documentation preview

bbozkaya left a comment

Choose a reason for hiding this comment

sararb commented Jan 10, 2023

Add docstrings and the parameter to `row_groups_per_part` to the MerlinDataLoader class #590

Add docstrings and the parameter to `row_groups_per_part` to the MerlinDataLoader class #590