Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: faster spatialpandas unique scalar values #6470

Merged
merged 4 commits into from
Dec 9, 2024

Conversation

maximlt
Copy link
Member

@maximlt maximlt commented Dec 8, 2024

While working on modernizing the NYC Buildings example (holoviz-topics/examples#386), I noticed that the creation of a HoloViews object (not the rendering) was very slow, the code below took ~35 seconds to run on my machine.

import time
import spatialpandas as spd
import spatialpandas.io
import hvplot.pandas # noqa

sdf = spd.io.read_parquet('/Users/mliquet/dev/examples/nyc_buildings/data/nyc_buildings.parq')
cats = ['unknown'] + list(sdf['type'].value_counts().iloc[:10].index.values)
sdf['type'] = sdf['type'].replace({None: 'unknown'})
sdf = sdf[sdf['type'].isin(cats)]
sdf['type'] = sdf['type'].astype('category')
sdf = sdf.build_sindex()

start = time.perf_counter()
print("Create object")
p = sdf.hvplot.polygons(rasterize=True, groupby='type', aggregator='any', width=600, height=500)
print(f"Obj creation time: {time.perf_counter() - start}")

The new code differs quite a bit from the old one, that was pure HoloViews. Note that the NYC Buildings example uses a DaskSpatialPandas object, I saw the problem was also present for a standard SpatialPandas object.

cats = ['unknown'] + list(sdf['type'].value_counts().iloc[:10].index.values)
polys = hv.Polygons(sdf, vdims='type')
hmap  = hv.HoloMap(OrderedDict([(cat, polys.select(type=cat)) for cat in cats]), 'Type', sort=False)
rasterize(hmap, aggregator='any').opts(width=600, height=500)

The main difference is that, internally, hvPlot creates a dynamic HoloViews groupby operation. This operation calls the values method of the interface, and it turns out that this is very slow for the spatialpandas one, since AFAIU, it has to deal with collecting the values that may be associated with every node of a geometry. This code is in get_value_array that loops through every row of the dataset, and every node of the geometry, and creates intermediate numpy arrays that are finally combined based on how the function is called.

This PR implements a couple of optimizations, the main one being in 3fc8935 that skips the loop entirely if all the values in the targeted column are scalar and if the function should return unique values (expanded=False, expanded=True is a bit more difficult o deal with as the values must be separated with np.nan, and not the scope of this PR). With this change, the timing went down to 0.4s, about 90x faster.

Copy link

codecov bot commented Dec 8, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 88.76%. Comparing base (43a0dbf) to head (c3c8f44).
Report is 2 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #6470      +/-   ##
==========================================
- Coverage   88.76%   88.76%   -0.01%     
==========================================
  Files         323      323              
  Lines       68676    68679       +3     
==========================================
+ Hits        60958    60960       +2     
- Misses       7718     7719       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@maximlt
Copy link
Member Author

maximlt commented Dec 8, 2024

Small change in c3c8f44 to make sure calling .unique() on a categorical column returns a numpy array instead of a categorical object. Without that, I got an error running the hvplot code from a Dask spatialpandas object, as the object returned by the function are concatenated and that part raised an error with the categorical object.

@maximlt maximlt changed the title opt: faster spatialpandas unique scalar values perf: faster spatialpandas unique scalar values Dec 8, 2024
Copy link
Member

@hoxbro hoxbro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Left one small question.

holoviews/core/data/spatialpandas.py Show resolved Hide resolved
@hoxbro hoxbro merged commit d586f3f into main Dec 9, 2024
15 of 16 checks passed
@hoxbro hoxbro deleted the opt_spatialpandas_values branch December 9, 2024 17:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants