-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF: non-numeric fillna #20300
Comments
What also might be useful, thinking about the case of geopandas: somehow an advanced but officially public |
Might be missing the larger context here but I recently implemented a Cythonized fill method for groupby objects that writes an array of indexes. pandas/pandas/_libs/groupby.pyx Line 270 in c36e9f7
A subsequent call to |
Yes, I think that looks about right, though I think it there would need to be an option to disable sorting for it to be usable outside a groupby context. |
The sorting in the GroupBy implementation is only required to ensure that group items are evaluated together. A non-groupby implementation would be simpler and to your point not even not the sort that's in that code |
Gotcha. Do you think we could share an implementation between groupby and
non-groupby? Is sorting the only difference?
…On Tue, Mar 27, 2018 at 9:21 AM, William Ayd ***@***.***> wrote:
The sorting in the GroupBy implementation is only required to ensure that
group items are evaluated together. A non-groupby implementation would be
simpler and to your point not even not the sort that's in that code
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#20300 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABQHIrUb8Vg8APoiH1NVQ04BCSB-pRO_ks5tikrigaJpZM4SmuSG>
.
|
I don't think they are that far off. In fact (though rather verbose) you could hack the groupby implementation as is to use with a Series In [25]: from pandas._libs import groupby as libgroupby
In [25]: import pandas.core.algorithms as algorithms
In [25]: ser = pd.Series([1., np.nan, np.nan, 2, np.nan, 3, np.nan])
In [25]: idxes = np.zeros_like(ser, dtype=np.int64) # positions to pass later to take
In [25]: dummy_labs = np.zeros_like(ser, dtype=np.int64) # just need the same val for all items
In [25]: mask = np.isnan(ser.values).view(np.uint8) # mask of NA vals
In [25]: libgroupby.group_fillna_indexer(idxes, dummy_labs, mask, 'ffill', -1)
In [25]: idxes
Out[25]: array([0, 0, 0, 3, 3, 5, 5])
In [27]: algorithms.take_nd(ser.values, idxes)
Out[27]: array([1., 1., 1., 2., 2., 3., 3.]) |
I don't want to open a pandora's box with this comment but this also has me thinking about our general approach to GroupBy and Series/DataFrame algorithms. The former tend to be copy/paste of the latter with a slight twist to account for Group labels, but I'm wondering if there's not a way for really all of the algorithms to be shared, with the Series/DataFrame ones providing either None or an empty array for Group labels. It's certainly no small undertaking, but it would:
Food for thought |
Not about the bigger issue you raise, but about this specific algorithm. It seems that the existing
(of course I don't know how much overhead there is by the fact it needs to do something with the grouping. But just saying that for the original raised issue, we actually already have all cython infrastructure available, it is more making a higher level function wrapping the above, to make the functionality more easily available) |
@jorisvandenbossche that's interesting, though I'd wonder if just the algos implementation is coded more efficiently and some of that could be ported over to the groupby. I can't imagine the labels (since they are all missing anyway) should add that much overhead, but that would have to be profiled in more detail |
DatetimeLikeArrayMixin would benefit from this (or a |
Split from a35f93c
self is a non-numeric array-like thing.
The text was updated successfully, but these errors were encountered: