perf: faster spatialpandas unique scalar values #6470

maximlt · 2024-12-08T18:14:27Z

While working on modernizing the NYC Buildings example (holoviz-topics/examples#386), I noticed that the creation of a HoloViews object (not the rendering) was very slow, the code below took ~35 seconds to run on my machine.

import time
import spatialpandas as spd
import spatialpandas.io
import hvplot.pandas # noqa

sdf = spd.io.read_parquet('/Users/mliquet/dev/examples/nyc_buildings/data/nyc_buildings.parq')
cats = ['unknown'] + list(sdf['type'].value_counts().iloc[:10].index.values)
sdf['type'] = sdf['type'].replace({None: 'unknown'})
sdf = sdf[sdf['type'].isin(cats)]
sdf['type'] = sdf['type'].astype('category')
sdf = sdf.build_sindex()

start = time.perf_counter()
print("Create object")
p = sdf.hvplot.polygons(rasterize=True, groupby='type', aggregator='any', width=600, height=500)
print(f"Obj creation time: {time.perf_counter() - start}")

The new code differs quite a bit from the old one, that was pure HoloViews. Note that the NYC Buildings example uses a DaskSpatialPandas object, I saw the problem was also present for a standard SpatialPandas object.

cats = ['unknown'] + list(sdf['type'].value_counts().iloc[:10].index.values)
polys = hv.Polygons(sdf, vdims='type')
hmap  = hv.HoloMap(OrderedDict([(cat, polys.select(type=cat)) for cat in cats]), 'Type', sort=False)
rasterize(hmap, aggregator='any').opts(width=600, height=500)

The main difference is that, internally, hvPlot creates a dynamic HoloViews groupby operation. This operation calls the values method of the interface, and it turns out that this is very slow for the spatialpandas one, since AFAIU, it has to deal with collecting the values that may be associated with every node of a geometry. This code is in get_value_array that loops through every row of the dataset, and every node of the geometry, and creates intermediate numpy arrays that are finally combined based on how the function is called.

This PR implements a couple of optimizations, the main one being in 3fc8935 that skips the loop entirely if all the values in the targeted column are scalar and if the function should return unique values (expanded=False, expanded=True is a bit more difficult o deal with as the values must be separated with np.nan, and not the scope of this PR). With this change, the timing went down to 0.4s, about 90x faster.

codecov · 2024-12-08T18:32:55Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 88.76%. Comparing base (43a0dbf) to head (c3c8f44).
Report is 2 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #6470      +/-   ##
==========================================
- Coverage   88.76%   88.76%   -0.01%     
==========================================
  Files         323      323              
  Lines       68676    68679       +3     
==========================================
+ Hits        60958    60960       +2     
- Misses       7718     7719       +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

maximlt · 2024-12-08T18:39:54Z

Small change in c3c8f44 to make sure calling .unique() on a categorical column returns a numpy array instead of a categorical object. Without that, I got an error running the hvplot code from a Dask spatialpandas object, as the object returned by the function are concatenated and that part raised an error with the categorical object.

hoxbro

LGTM.

Left one small question.

holoviews/core/data/spatialpandas.py

maximlt added 3 commits December 8, 2024 18:46

return early when no data

56727be

compute geom lenght only when needed

fe122ca

compute unique scalar values early

3fc8935

handle calling unique on a categorical column

c3c8f44

maximlt changed the title ~~opt: faster spatialpandas unique scalar values~~ perf: faster spatialpandas unique scalar values Dec 8, 2024

This was referenced Dec 8, 2024

NYC_buildings: Modernize notebook holoviz-topics/examples#386

Open

Plan to V2 holoviz-topics/examples#383

Open

hoxbro approved these changes Dec 9, 2024

View reviewed changes

holoviews/core/data/spatialpandas.py Show resolved Hide resolved

hoxbro merged commit d586f3f into main Dec 9, 2024
15 of 16 checks passed

hoxbro deleted the opt_spatialpandas_values branch December 9, 2024 17:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: faster spatialpandas unique scalar values #6470

perf: faster spatialpandas unique scalar values #6470

maximlt commented Dec 8, 2024 •

edited

Loading

codecov bot commented Dec 8, 2024 •

edited

Loading

maximlt commented Dec 8, 2024

hoxbro left a comment

perf: faster spatialpandas unique scalar values #6470

perf: faster spatialpandas unique scalar values #6470

Conversation

maximlt commented Dec 8, 2024 • edited Loading

codecov bot commented Dec 8, 2024 • edited Loading

Codecov Report

maximlt commented Dec 8, 2024

hoxbro left a comment

Choose a reason for hiding this comment

maximlt commented Dec 8, 2024 •

edited

Loading

codecov bot commented Dec 8, 2024 •

edited

Loading