PERF: non-numeric fillna #20300

TomAugspurger · 2018-03-12T14:19:56Z

Split from a35f93c

# ffill / bfill
# The basic idea is to create an array of integer positions.
# Internally, we use iNaT and the datetime filling routines
# to avoid floating-point NaN. Once filled, we take on `self`
# to get the actual values.
func = pad_1d if method == 'pad' else backfill_1d
idx = np.arange(len(self), dtype='int64')
idx[mask] = iNaT
idx = _ensure_platform_int(func(idx, mask=mask,
                                limit=limit,
                                dtype='datetime64[ns]'))
idx[idx == iNaT] = -1  # missing value marker for take.
new_values = self.take(idx)

self is a non-numeric array-like thing.

The text was updated successfully, but these errors were encountered:

jorisvandenbossche · 2018-03-12T14:30:12Z

What also might be useful, thinking about the case of geopandas: somehow an advanced but officially public fillna function that works on numeric data. In geopandas I have integer data under the hood I could forward or backward fill or fill with a value. So inside a GeometryArray.fillna method I could pass those data to such a fillna function together with the specified method/value/limit keywords, and then wrap the result of that in again into GeometryArray before returning.
Similar pattern could be useful for PeriodArray.

WillAyd · 2018-03-27T06:13:27Z

Might be missing the larger context here but I recently implemented a Cythonized fill method for groupby objects that writes an array of indexes.

pandas/pandas/_libs/groupby.pyx

Line 270 in c36e9f7

def group_fillna_indexer(ndarray[int64_t] out, ndarray[int64_t] labels,

A subsequent call to algorithms.take_nd allows you to get backed "filled" values in a performant way agnostic of the object type - is that in the direction of what you are looking for here?

TomAugspurger · 2018-03-27T11:27:02Z

Yes, I think that looks about right, though I think it there would need to be an option to disable sorting for it to be usable outside a groupby context.

WillAyd · 2018-03-27T14:21:16Z

The sorting in the GroupBy implementation is only required to ensure that group items are evaluated together. A non-groupby implementation would be simpler and to your point not even not the sort that's in that code

TomAugspurger · 2018-03-28T14:50:11Z

Gotcha. Do you think we could share an implementation between groupby and non-groupby? Is sorting the only difference?

…

On Tue, Mar 27, 2018 at 9:21 AM, William Ayd ***@***.***> wrote: The sorting in the GroupBy implementation is only required to ensure that group items are evaluated together. A non-groupby implementation would be simpler and to your point not even not the sort that's in that code — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#20300 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHIrUb8Vg8APoiH1NVQ04BCSB-pRO_ks5tikrigaJpZM4SmuSG> .

WillAyd · 2018-03-28T22:52:57Z

I don't think they are that far off. In fact (though rather verbose) you could hack the groupby implementation as is to use with a Series

In [25]: from pandas._libs import groupby as libgroupby
In [25]: import pandas.core.algorithms as algorithms
In [25]: ser = pd.Series([1., np.nan, np.nan, 2, np.nan, 3, np.nan])
In [25]: idxes = np.zeros_like(ser, dtype=np.int64)  # positions to pass later to take
In [25]: dummy_labs = np.zeros_like(ser, dtype=np.int64)  # just need the same val for all items
In [25]: mask = np.isnan(ser.values).view(np.uint8)  # mask of NA vals
In [25]: libgroupby.group_fillna_indexer(idxes, dummy_labs, mask, 'ffill', -1)

In [25]: idxes
Out[25]: array([0, 0, 0, 3, 3, 5, 5])

In [27]: algorithms.take_nd(ser.values, idxes)
Out[27]: array([1., 1., 1., 2., 2., 3., 3.])

WillAyd · 2018-03-29T01:35:33Z

I don't want to open a pandora's box with this comment but this also has me thinking about our general approach to GroupBy and Series/DataFrame algorithms. The former tend to be copy/paste of the latter with a slight twist to account for Group labels, but I'm wondering if there's not a way for really all of the algorithms to be shared, with the Series/DataFrame ones providing either None or an empty array for Group labels.

It's certainly no small undertaking, but it would:

Decrease technical debt, by only having on reference implementation and less duplication
Decrease behavior nuances and inconsistent keyword support across the various types (ref FIX: add support for desc order when ranking infs with nans #19538 #20091)
Assumedly reduce distribution size from having a smaller code base (?)

Food for thought

jorisvandenbossche · 2018-03-29T08:50:37Z

Not about the bigger issue you raise, but about this specific algorithm. It seems that the existing pad_1d algorithm (what Tom posted in the top post) is somewhat faster as the groupby one:

In [36]: ser = pd.Series(np.arange(100000))

In [37]: ser[np.random.randint(0, 100000, 10000)] = np.nan

In [38]: %%timeit
    ...: idx = np.arange(len(ser), dtype='int64')
    ...: mask = np.isnan(ser.values)
    ...: pad_1d(idx, mask=mask, dtype='datetime64[ns]')
    ...: idx[idx == iNaT] = -1
    ...: 
435 µs ± 20.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [39]: %%timeit
    ...: idxes = np.zeros_like(ser, dtype=np.int64)  # positions to pass later to take
    ...: dummy_labs = np.zeros_like(ser, dtype=np.int64)  # just need the same val for all items
    ...: mask = np.isnan(ser.values).view(np.uint8)  # mask of NA vals
    ...: libgroupby.group_fillna_indexer(idxes, dummy_labs, mask, 'ffill', -1)
    ...: 
1.81 ms ± 28 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

(of course I don't know how much overhead there is by the fact it needs to do something with the grouping. But just saying that for the original raised issue, we actually already have all cython infrastructure available, it is more making a higher level function wrapping the above, to make the functionality more easily available)

WillAyd · 2018-03-29T15:05:34Z

@jorisvandenbossche that's interesting, though I'd wonder if just the algos implementation is coded more efficiently and some of that could be ported over to the groupby. I can't imagine the labels (since they are all missing anyway) should add that much overhead, but that would have to be profiled in more detail

TomAugspurger · 2018-12-18T15:06:10Z

DatetimeLikeArrayMixin would benefit from this (or a _values_for_fillna) as well.

TomAugspurger mentioned this issue Mar 12, 2018

ENH: ExtensionArray.fillna #19909

Merged

TomAugspurger added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Performance Memory or execution speed performance Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Difficulty Intermediate labels Mar 12, 2018

TomAugspurger mentioned this issue Jan 1, 2019

REF: DatetimeLikeArray #24024

Merged

12 tasks

jbrockmendel removed Effort Medium labels Oct 21, 2019

mroeschke removed the Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff label Jun 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: non-numeric fillna #20300

PERF: non-numeric fillna #20300

TomAugspurger commented Mar 12, 2018

jorisvandenbossche commented Mar 12, 2018

WillAyd commented Mar 27, 2018

TomAugspurger commented Mar 27, 2018

WillAyd commented Mar 27, 2018

TomAugspurger commented Mar 28, 2018 via email

WillAyd commented Mar 28, 2018

WillAyd commented Mar 29, 2018 •

edited

Loading

jorisvandenbossche commented Mar 29, 2018

WillAyd commented Mar 29, 2018

TomAugspurger commented Dec 18, 2018

PERF: non-numeric fillna #20300

PERF: non-numeric fillna #20300

Comments

TomAugspurger commented Mar 12, 2018

jorisvandenbossche commented Mar 12, 2018

WillAyd commented Mar 27, 2018

TomAugspurger commented Mar 27, 2018

WillAyd commented Mar 27, 2018

TomAugspurger commented Mar 28, 2018 via email

WillAyd commented Mar 28, 2018

WillAyd commented Mar 29, 2018 • edited Loading

jorisvandenbossche commented Mar 29, 2018

WillAyd commented Mar 29, 2018

TomAugspurger commented Dec 18, 2018

WillAyd commented Mar 29, 2018 •

edited

Loading