Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding an FAQ on recommended range of dataset size for cudf.pandas #16869

Closed
11 changes: 11 additions & 0 deletions docs/cudf/source/cudf_pandas/faq.md
Original file line number Diff line number Diff line change
Expand Up @@ -194,3 +194,14 @@ for testing or benchmarking purposes. To do so, set the
```bash
CUDF_PANDAS_FALLBACK_MODE=1 python -m cudf.pandas some_script.py
```

## What is the recommended range of dataset size that I can process using `cudf.pandas`?

`cudf.pandas` can process a wide range of dataset size. As a
_very rough_ rule of thumb,`cudf.pandas` shines on workflows with more than
10,000 - 100,000 rows of data, depending on the algorithms, data types, and other factors.
Below this range, workflows might execute slower on GPU than CPU because of the
cost of data transfers. With [managed memory pool and managed memory prefetching enabled in cudf
by default](how-it-works.md), you can process datasets larger than GPU memory and up to a theoretical
limit of the combined CPU and GPU memory size. However, note that the
best performance with large data sizes can be data and workflow dependent.
Loading