Fix memory explosion when auto-calculating canvas range extents with dask #717
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR improves the Canvas autorange logic to avoid bringing each full x/y array into memory in order to compute the min/max values.
This fixes #668 for me. Here is the test case I I've been using to diagnose and test this PR. I started a Dask distributed instance on a large workstation and persisted the ~3-billion point OSM dataset into memory. After this persist, the Dask dashboard reports ~28GB memory used. Then I set up a VM with 8GB of RAM and connected it as a client of the distributed scheduler. I then performed a
cvs.points
aggregation without specifyingx_range
/y_range
, causing the autorange logic to be invoked.Before these changes, the memory usage of the client would climb steadily until the kernel died. With these changes, the aggregation completes successfully with no noticeable increase in memory usage on the client.
cc @jacobtomlinson