perf: faster spatialpandas unique scalar values #6470
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
While working on modernizing the NYC Buildings example (holoviz-topics/examples#386), I noticed that the creation of a HoloViews object (not the rendering) was very slow, the code below took ~35 seconds to run on my machine.
The new code differs quite a bit from the old one, that was pure HoloViews. Note that the NYC Buildings example uses a DaskSpatialPandas object, I saw the problem was also present for a standard SpatialPandas object.
The main difference is that, internally, hvPlot creates a dynamic HoloViews
groupby
operation. This operation calls thevalues
method of the interface, and it turns out that this is very slow for the spatialpandas one, since AFAIU, it has to deal with collecting the values that may be associated with every node of a geometry. This code is inget_value_array
that loops through every row of the dataset, and every node of the geometry, and creates intermediate numpy arrays that are finally combined based on how the function is called.This PR implements a couple of optimizations, the main one being in 3fc8935 that skips the loop entirely if all the values in the targeted column are scalar and if the function should return unique values (
expanded=False
,expanded=True
is a bit more difficult o deal with as the values must be separated withnp.nan
, and not the scope of this PR). With this change, the timing went down to 0.4s, about 90x faster.