Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add docstrings and the parameter to row_groups_per_part to the MerlinDataLoader class #590

Merged
merged 4 commits into from
Jan 10, 2023

Conversation

sararb
Copy link
Contributor

@sararb sararb commented Dec 29, 2022

Fixes #550

@bbozkaya runs different tests (see image below) of repartitioning a parquet file (using pandas or cudf) and it seems that MerlinDataLoader always loads the dataset files with 1 partition even though we partition to multiple groups when saving the parquet file (as recommended here). To take into account these partitions, we should pass the parameter row_groups_per_part=True to the merlin.io.Dataset.
image

Goals ⚽

  • Add the parameter row_groups_per_part to MerlinDataLoader so as to load the dataset with the correct partitions.
  • Add docstrings to the MerlinDataLoader to explain the different parameters.
  • Add a user warning to ensure that dataset's partitions are divisible by the number of GPUs for DDP training. This is needed to ensure optimal performance by equally distributing the data among available GPUs.

@sararb sararb added enhancement New feature or request Multi-GPU labels Dec 29, 2022
@sararb sararb added this to the Merlin 23.01 milestone Dec 29, 2022
@sararb sararb self-assigned this Dec 29, 2022
@github-actions
Copy link

Copy link
Contributor

@bbozkaya bbozkaya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested on 2 GPUs. It works fine when loading partitioned train data. For validation whoever, it seems to be using only 1 GPU and 1 partition. Is this expected or does it also need to be addressed?

@sararb
Copy link
Contributor Author

sararb commented Jan 10, 2023

I tested on 2 GPUs. It works fine when loading partitioned train data. For validation whoever, it seems to be using only 1 GPU and 1 partition. Is this expected or does it also need to be addressed?

Thank you for testing the solution! For the validation step, we rely on how HF transformers are setting the DDP training, and it seems that they don't wrap the model in a DDP mode for evaluation (training=False) (here). So it is expected that the validation runs on a single GPU but I don't know the motivation behind. I posted a question on HF forum to better understand the behavior of the Trainer in DDP+evaluation mode.

@bbozkaya bbozkaya merged commit 95d0a6e into main Jan 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Multi-GPU
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] fix partition script in the DistributedDataParallel documentation
3 participants