Add best practices page to Dask cuDF docs #16821

rjzamora · 2024-09-17T19:01:15Z

Description

Adds a much-needed "best practices" page to the Dask cuDF documentation.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

…est-practices

rjzamora · 2024-09-17T19:06:57Z

@jacobtomlinson @quasiben @pentschev - Interested in your feedback on the specific "best practices" guidelines I added here. Happy to revise.

rjzamora · 2024-09-17T19:07:50Z

docs/dask_cudf/source/best_practices.rst

+Deployment and Configuration
+----------------------------
+
+Use Dask DataFrame Directly


Not sure what "section" this belongs to.

Maybe just move it to the end of the Deployment and Configuration section? The rationale is this at the top may look like a suggestion, which is actually the opposite of what this subsection says. No strong opinions though.

pentschev

Looks very good to me, although I'm far from experienced with Dask cuDF best practices. I've left a few suggestions that I hope may be useful to improve quality a bit, but just as is this looks great! Thanks @rjzamora .

docs/dask_cudf/source/best_practices.rst

pentschev · 2024-09-17T19:49:21Z

docs/dask_cudf/source/best_practices.rst

+The ideal partition size is typically between 2-10% of the memory capacity
+of a single GPU. Increasing the partition size will typically reduce the


Should we provide here a rule-of-thumb as to whether users should initially target more to the 2% or the 10% range, and how/when to increase/decrease that? Or is this too difficult to provide a good rule-of-thumb and the 2-10% phrasing is the best we can do? I understand it can be quite difficult to give more details for general purpose docs, so it's fine if you think the current phrasing is sufficient/best.

Okay, I attempted to turn this into the more-explicit "rule of thumb" I personally use: 1/16 or less if the workflow is memory-intensive (i.e. shuffle intensive), and 1/8 otherwise. The "best" partition size is definitely difficult to know a priori.

pentschev · 2024-09-17T19:51:19Z

docs/dask_cudf/source/best_practices.rst

+
+``blocksize``: Use this argument to specify the maximum partition size.
+The default is `"256 MiB"`, but larger values are usually more performant
+(e.g. `1 GiB` is usually safe). Dask will use the ``blocksize`` value to map


Maybe provide a guideline for when 1 GiB is safe. I imagine this is safe for large devices that we're usually used to work with, but given the recent NO-OOM effort I don't think a small laptop GPU will be capable of handling 1GiB safely.

I decided to remove the 1 GiB comment since we already discuss the 1/16-1/8 "rule of thumb" above.

pentschev · 2024-09-17T19:53:21Z

docs/dask_cudf/source/best_practices.rst

+  many of the details discussed in the `Dask DataFrames Best Practices
+  <https://docs.dask.org/en/stable/dataframe-best-practices.html>`__
+  documentation also apply to Dask cuDF.


Are there any notable ones that are known NOT to apply to Dask cuDF and we should let users know here?

I updated the wording to say the guidelines that are not pandas-specific also apply to Dask cuDF.

pentschev · 2024-09-17T19:56:33Z

docs/dask_cudf/source/best_practices.rst

+Deployment and Configuration
+----------------------------
+
+Use Dask DataFrame Directly


Maybe just move it to the end of the Deployment and Configuration section? The rationale is this at the top may look like a suggestion, which is actually the opposite of what this subsection says. No strong opinions though.

pentschev · 2024-09-17T19:58:52Z

docs/dask_cudf/source/best_practices.rst

+Sorting, joining and grouping operations all have the potential to
+require the global shuffling of data between distinct partitions.


This also applies to repartition, no? Maybe we should mention the same arguments here or point https://github.com/rapidsai/cudf/pull/16821/files#diff-1c3b287013ea5f3b56726f0b7e7538bd0242e8cadcbd89e966a82d3d36719317R93-R95 to here as well to make it more explicit where users are dealing with in those cases.

Hmmm. Repartition does not really require data "shuffling". Data shuffling requires "all-to-all", while repartitioning is usually limited to data movement between neighboring partitions.

pentschev · 2024-09-17T20:00:10Z

docs/dask_cudf/source/best_practices.rst

+* Use a distributed cluster with Dask-CUDA workers
+* Use native cuDF spilling whenever possible


Should we link to the sections above in here that deal with these two suggestions, for the benefit of the reader who may come directly to this section? Not sure if easy/possible, if not skipping is also fine.

Yeah, good idea. Will need to figure out how to do that :)

rjzamora · 2024-09-18T18:44:22Z

@VibhuJawa @ayushdg @randerzander - I would consider you all dask-cudf power users. Let me know if these "best practices" seem reasonable to you.

ayushdg

Thanks a lot for this! I found the page very helpful overall. Left a few comments and nits

docs/dask_cudf/source/best_practices.rst

ayushdg · 2024-09-18T19:35:21Z

docs/dask_cudf/source/best_practices.rst

+``False``, but ``aggregate_files=True`` is usually more performant when
+the dataset contains many files that are smaller than half of ``blocksize``.
+
+.. note::


I like this note. Once we have more cloud IO specific optimizations it might make sense to add it to best practices or create a new one for cloud IO to discuss tips/tricks for those environments.

Agree that we need a lot more remote-IO information. However, it doesn't feel like there is much to say yet :/

docs/dask_cudf/source/best_practices.rst

…est-practices

jacobtomlinson

This looks great. I added a few general thoughts.

docs/dask_cudf/source/best_practices.rst

…est-practices

VibhuJawa

Doc mostly looks great to me, thanks for adding it. Have left some small notes

docs/dask_cudf/source/best_practices.rst

…est-practices

docs/dask_cudf/source/best_practices.rst

…est-practices

rjzamora · 2024-09-20T14:17:22Z

I will plan to merge this in a few hours if there aren't any more comments.

wence-

Looks great, I only had a few tiny wording nits

docs/dask_cudf/source/best_practices.rst

Co-authored-by: Lawrence Mitchell <wence@gmx.li>

rjzamora · 2024-09-20T17:04:54Z

/merge

rjzamora added 3 commits September 16, 2024 15:21

start best practices page for dask-cudf

f01fd71

revisions

7aa8041

Merge remote-tracking branch 'upstream/branch-24.10' into dask-cudf-b…

b2ce634

…est-practices

rjzamora added 2 - In Progress Currently a work in progress doc Documentation non-breaking Non-breaking change labels Sep 17, 2024

rjzamora self-assigned this Sep 17, 2024

github-actions bot added the Python Affects Python cuDF API. label Sep 17, 2024

rjzamora marked this pull request as ready for review September 17, 2024 19:05

rjzamora requested a review from a team as a code owner September 17, 2024 19:05

rjzamora commented Sep 17, 2024

View reviewed changes

pentschev approved these changes Sep 17, 2024

View reviewed changes

rjzamora added 2 commits September 17, 2024 17:54

address code review

1e028ea

more revisions

3332717

rjzamora added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Sep 18, 2024

ayushdg reviewed Sep 18, 2024

View reviewed changes

rjzamora added 8 commits September 18, 2024 14:54

more revisions

eee37f3

Merge remote-tracking branch 'upstream/branch-24.10' into dask-cudf-b…

7c63c7e

…est-practices

add from_map note on meta

a425405

add note on diagnostics

9233524

fix typos

bd144c2

tweak wording

5f854e7

Merge remote-tracking branch 'upstream/branch-24.10' into dask-cudf-b…

397efa7

…est-practices

fix map_partitions typo

6c8771b

jacobtomlinson reviewed Sep 19, 2024

View reviewed changes

rjzamora added 2 commits September 19, 2024 07:05

revisions

f7731b8

Merge remote-tracking branch 'upstream/branch-24.10' into dask-cudf-b…

581a69f

…est-practices

rjzamora added 2 commits September 19, 2024 08:33

fix spelling error and add link to quick-start example

8515cb9

replace link to readme

a23deff

ayushdg mentioned this pull request Sep 19, 2024

Add OOM section to Best Practices NVIDIA/NeMo-Curator#244

Merged

VibhuJawa reviewed Sep 19, 2024

View reviewed changes

docs/dask_cudf/source/best_practices.rst Outdated Show resolved Hide resolved

docs/dask_cudf/source/best_practices.rst Show resolved Hide resolved

docs/dask_cudf/source/best_practices.rst Show resolved Hide resolved

rjzamora added 4 commits September 19, 2024 16:26

Merge remote-tracking branch 'upstream/branch-24.10' into dask-cudf-b…

4c1b55d

…est-practices

add a bit more info about wait and CPU-GPU data movement

8ecd536

Merge branch 'branch-24.10' into dask-cudf-best-practices

251bf23

update

40a638e

rjzamora commented Sep 20, 2024

View reviewed changes

docs/dask_cudf/source/best_practices.rst Outdated Show resolved Hide resolved

rjzamora commented Sep 20, 2024

View reviewed changes

docs/dask_cudf/source/best_practices.rst Outdated Show resolved Hide resolved

Apply suggestions from code review

d082cac

rjzamora commented Sep 20, 2024

View reviewed changes

docs/dask_cudf/source/best_practices.rst Outdated Show resolved Hide resolved

rjzamora added 6 commits September 20, 2024 08:41

Apply suggestions from code review

8152fca

Merge remote-tracking branch 'upstream/branch-24.10' into dask-cudf-b…

a653a5a

…est-practices

fix lists

91d4fd5

fix func list

d58a5ce

roll back func change

59e597a

fix more double-colon mistakes

adbd22d

Merge branch 'branch-24.10' into dask-cudf-best-practices

216d5de

wence- approved these changes Sep 20, 2024

View reviewed changes

Apply suggestions from code review

d76dbd6

Co-authored-by: Lawrence Mitchell <wence@gmx.li>

rjzamora added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels Sep 20, 2024

Merge branch 'branch-24.10' into dask-cudf-best-practices

da7308a

rapids-bot bot merged commit b165210 into rapidsai:branch-24.10 Sep 20, 2024
96 checks passed

rjzamora deleted the dask-cudf-best-practices branch September 21, 2024 03:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add best practices page to Dask cuDF docs #16821

Add best practices page to Dask cuDF docs #16821

rjzamora commented Sep 17, 2024

rjzamora commented Sep 17, 2024

rjzamora Sep 17, 2024

pentschev Sep 17, 2024

pentschev left a comment

pentschev Sep 17, 2024

rjzamora Sep 18, 2024

pentschev Sep 17, 2024

rjzamora Sep 18, 2024

pentschev Sep 17, 2024

rjzamora Sep 18, 2024

pentschev Sep 17, 2024

pentschev Sep 17, 2024

rjzamora Sep 18, 2024

pentschev Sep 17, 2024

rjzamora Sep 18, 2024

rjzamora commented Sep 18, 2024

ayushdg left a comment

ayushdg Sep 18, 2024

rjzamora Sep 18, 2024

jacobtomlinson left a comment

VibhuJawa left a comment

rjzamora commented Sep 20, 2024

wence- left a comment

rjzamora commented Sep 20, 2024

		The ideal partition size is typically between 2-10% of the memory capacity
		of a single GPU. Increasing the partition size will typically reduce the

		Sorting, joining and grouping operations all have the potential to
		require the global shuffling of data between distinct partitions.

		* Use a distributed cluster with Dask-CUDA workers
		* Use native cuDF spilling whenever possible

Add best practices page to Dask cuDF docs #16821

Add best practices page to Dask cuDF docs #16821

Conversation

rjzamora commented Sep 17, 2024

Description

Checklist

rjzamora commented Sep 17, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pentschev left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rjzamora commented Sep 18, 2024

ayushdg left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jacobtomlinson left a comment

Choose a reason for hiding this comment

VibhuJawa left a comment

Choose a reason for hiding this comment

rjzamora commented Sep 20, 2024

wence- left a comment

Choose a reason for hiding this comment

rjzamora commented Sep 20, 2024