Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding an FAQ on recommended range of dataset size for cudf.pandas #16869

Closed

Conversation

singhmanas1
Copy link
Contributor

@singhmanas1 singhmanas1 commented Sep 23, 2024

Description

Currently, it's hard to know the range of dataset size that users can process with cudf.pandas whilst still seeing performance gains over CPU-pandas. With managed memory enabled by default, dataset sizes can be much bigger than GPU Memory up to the theoretical limit of CPU + GPU Memory. However, the performance degradation can happen much below the theoretical limit depending on the workflow. This PR adds an faq that helps clarify this complexity with specific examples from the one billion row challenge. Note: the source of the exploration that led to this PR is an internal VDR feedback. Quoting the VDR feedback below-

To encourage best usage of cuDF Pandas, we recommend communicating target dataset sizes in the documentation. Developers using too small a dataset will see slower performance than CPU only, and too large a dataset will yield no performance increase due to GPU memory constraints.

Related to the question added in the PR - Add performance tips to cudf.pandas FAQ. we can also merge this PR with the question added in PR #16693

Note: This PR is based on some very initial exploration and requires some further discussions on -

  1. Whether any recommendation on upper limit of dataset size for cudf.pandas would be self-limiting and inaccurate?
  2. Do we need to do any additional experiments to give a directional idea of the upper limit. Ex. performing billon row challenge to determine how high we can go in terms of dataset size before we see significant performance degradation. One potential experiment - extending the one billion row challenge results HERE to the point where the cudf-pandas 24.08 and pandas performance are at par. This crossover point could illustrate that its not always that you could go upto the limit of CPU + GPU memory when using cudf.pandas.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

Copy link

copy-pr-bot bot commented Sep 23, 2024

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@singhmanas1 singhmanas1 reopened this Sep 23, 2024
@galipremsagar galipremsagar added doc Documentation improvement Improvement / enhancement to an existing function non-breaking Non-breaking change 5 - Ready to Merge Testing and reviews complete, ready to merge and removed improvement Improvement / enhancement to an existing function labels Sep 25, 2024
@galipremsagar galipremsagar changed the title Adding an FAQ on recommended range of dataset size for cudf.pandas Adding an FAQ on recommended range of dataset size for cudf.panda Sep 25, 2024
@galipremsagar galipremsagar changed the title Adding an FAQ on recommended range of dataset size for cudf.panda Adding an FAQ on recommended range of dataset size for cudf.pandas Sep 25, 2024
@galipremsagar
Copy link
Contributor

/okay to test

@galipremsagar galipremsagar removed the 5 - Ready to Merge Testing and reviews complete, ready to merge label Sep 25, 2024
@bdice
Copy link
Contributor

bdice commented Sep 25, 2024

Closing as duplicate of #16693 for now.

@bdice bdice closed this Sep 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
doc Documentation non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants