Adding an FAQ on recommended range of dataset size for cudf.pandas #16869
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Currently, it's hard to know the range of dataset size that users can process with cudf.pandas whilst still seeing performance gains over CPU-pandas. With managed memory enabled by default, dataset sizes can be much bigger than GPU Memory up to the theoretical limit of CPU + GPU Memory. However, the performance degradation can happen much below the theoretical limit depending on the workflow. This PR adds an faq that helps clarify this complexity with specific examples from the one billion row challenge. Note: the source of the exploration that led to this PR is an internal VDR feedback. Quoting the VDR feedback below-
To encourage best usage of cuDF Pandas, we recommend communicating target dataset sizes in the documentation. Developers using too small a dataset will see slower performance than CPU only, and too large a dataset will yield no performance increase due to GPU memory constraints.
Related to the question added in the PR - Add performance tips to cudf.pandas FAQ. we can also merge this PR with the question added in PR #16693
Note: This PR is based on some very initial exploration and requires some further discussions on -
cudf.pandas
would be self-limiting and inaccurate?Checklist