Adding an FAQ on recommended range of dataset size for cudf.pandas #16869

singhmanas1 · 2024-09-23T00:26:02Z

Description

Currently, it's hard to know the range of dataset size that users can process with cudf.pandas whilst still seeing performance gains over CPU-pandas. With managed memory enabled by default, dataset sizes can be much bigger than GPU Memory up to the theoretical limit of CPU + GPU Memory. However, the performance degradation can happen much below the theoretical limit depending on the workflow. This PR adds an faq that helps clarify this complexity with specific examples from the one billion row challenge. Note: the source of the exploration that led to this PR is an internal VDR feedback. Quoting the VDR feedback below-

To encourage best usage of cuDF Pandas, we recommend communicating target dataset sizes in the documentation. Developers using too small a dataset will see slower performance than CPU only, and too large a dataset will yield no performance increase due to GPU memory constraints.

Related to the question added in the PR - Add performance tips to cudf.pandas FAQ. we can also merge this PR with the question added in PR #16693

Note: This PR is based on some very initial exploration and requires some further discussions on -

Whether any recommendation on upper limit of dataset size for cudf.pandas would be self-limiting and inaccurate?
Do we need to do any additional experiments to give a directional idea of the upper limit. Ex. performing billon row challenge to determine how high we can go in terms of dataset size before we see significant performance degradation. One potential experiment - extending the one billion row challenge results HERE to the point where the cudf-pandas 24.08 and pandas performance are at par. This crossover point could illustrate that its not always that you could go upto the limit of CPU + GPU memory when using cudf.pandas.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

copy-pr-bot · 2024-09-23T00:26:06Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

galipremsagar · 2024-09-25T17:53:27Z

/okay to test

bdice · 2024-09-25T20:11:01Z

Closing as duplicate of #16693 for now.

Manas Singh added 3 commits September 22, 2024 16:54

added an faq on recommended range of dataset size

5a3e609

added an faq for recommended range of dataset size

e8dfdf0

adding placeholder for experimental details

188c0b9

singhmanas1 closed this Sep 23, 2024

singhmanas1 added 2 commits September 22, 2024 20:09

Modified recommended dataset sizes in the faq

a335016

Added link to 'how-it-works' page

de1ce2a

singhmanas1 reopened this Sep 23, 2024

Merge branch 'branch-24.10' into feat/faq_cudf_pandas

900c4cd

galipremsagar approved these changes Sep 25, 2024

View reviewed changes

Merge branch 'branch-24.10' into feat/faq_cudf_pandas

4122493

galipremsagar added doc Documentation improvement Improvement / enhancement to an existing function non-breaking Non-breaking change 5 - Ready to Merge Testing and reviews complete, ready to merge and removed improvement Improvement / enhancement to an existing function labels Sep 25, 2024

galipremsagar changed the title ~~Adding an FAQ on recommended range of dataset size for cudf.pandas~~ Adding an FAQ on recommended range of dataset size for cudf.panda Sep 25, 2024

galipremsagar changed the title ~~Adding an FAQ on recommended range of dataset size for cudf.panda~~ Adding an FAQ on recommended range of dataset size for cudf.pandas Sep 25, 2024

galipremsagar removed the 5 - Ready to Merge Testing and reviews complete, ready to merge label Sep 25, 2024

bdice closed this Sep 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding an FAQ on recommended range of dataset size for cudf.pandas #16869

Adding an FAQ on recommended range of dataset size for cudf.pandas #16869

singhmanas1 commented Sep 23, 2024 •

edited

Loading

copy-pr-bot bot commented Sep 23, 2024

galipremsagar commented Sep 25, 2024

bdice commented Sep 25, 2024

Adding an FAQ on recommended range of dataset size for cudf.pandas #16869

Adding an FAQ on recommended range of dataset size for cudf.pandas #16869

Conversation

singhmanas1 commented Sep 23, 2024 • edited Loading

Description

Checklist

copy-pr-bot bot commented Sep 23, 2024

galipremsagar commented Sep 25, 2024

bdice commented Sep 25, 2024

singhmanas1 commented Sep 23, 2024 •

edited

Loading