-
-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF: Cythonize Groupby Rank #19481
PERF: Cythonize Groupby Rank #19481
Conversation
Hello @WillAyd! Thanks for updating the PR. Cheers ! There are no PEP8 issues in this Pull Request. 🍻 Comment last updated on February 09, 2018 at 18:37 Hours UTC |
pandas/_libs/groupby_helper.pxi.in
Outdated
ndarray[{{c_type}}, ndim=2] values, | ||
ndarray[int64_t] labels, | ||
bint is_datetimelike, **kwargs): | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks ok. don't pass kwargs generally around (esp in cython). we want to specify the args directly. this may make the call slightly more tricky in cython_transform (e.g. you may need to use a partial or lambda to hold the additional args)
Still not complete but sharing in interim for any code review as this change is large. At a minimum I need to go back and use something besides kwargs to pass the appropriate values back |
pandas/_libs/groupby.pyx
Outdated
out[_as[j], 0] = i - grp_start + 1 | ||
elif tiebreak == TIEBREAK_FIRST: | ||
for j in range(i - dups + 1, i + 1): | ||
if ascending: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could arguably also be done using the TIEBREAK_FIRST_DESCENDING flag, but I figured it was worth just using TIEBREAK_FIRST and adding a conditional given the former is what is being passed
pandas/_libs/groupby.pyx
Outdated
grp_na_count += 1 | ||
out[_as[i], 0] = np.nan | ||
else: | ||
if tiebreak == TIEBREAK_AVERAGE: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general this looping mechanism isn't very efficient because it overwrites "duplicate" values continually in a loop. Given the benchmarks were still significantly faster I left as is and was planning to open up a separate change to optimize further, but could review as part of this if you feel this loop mechanism is not acceptable
pandas/_libs/groupby_helper.pxi.in
Outdated
dups += 1 | ||
sum_ranks += i - grp_start + 1 | ||
|
||
if keep_na and masked_vals[_as[i]] == nan_fill_val: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comparison to the nan_fill_val
gets a little tricky because it obfuscates np.nan
with np.inf
(or whatever fill value is being used). That said, that is a pre-existing limitation which I opened #19538 to address
pandas/_libs/groupby.pyx
Outdated
ascending = kwargs['ascending'] | ||
pct = kwargs['pct'] | ||
keep_na = kwargs['na_option'] == 'keep' | ||
N, K = (<object> values).shape |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Due to my limited understanding I am not using the K value that gets extracted here, as I couldn't figure out under what circumstance K was ever not equal to 0. Can you advise how that works or what to look at to help my comprehension?
Benchmarks provided below for reference Before after ratio
[f391cbfe] [27cb6b57]
+ 139.69μs 535.43μs 3.83 groupby.GroupByMethods.time_method('float', 'cummin')
+ 142.06μs 522.73μs 3.68 groupby.GroupByMethods.time_method('float', 'cummax')
+ 145.02μs 527.11μs 3.63 groupby.GroupByMethods.time_method('int', 'cummin')
+ 162.18μs 574.63μs 3.54 groupby.GroupByMethods.time_method('float', 'cumsum')
+ 145.86μs 512.28μs 3.51 groupby.GroupByMethods.time_method('int', 'cummax')
+ 159.64μs 533.92μs 3.34 groupby.GroupByMethods.time_method('int', 'cumsum')
+ 311.50μs 728.59μs 2.34 groupby.GroupByMethods.time_method('float', 'cumprod')
+ 315.28μs 650.14μs 2.06 groupby.GroupByMethods.time_method('int', 'cumprod')
+ 872.00μs 1.24ms 1.42 groupby.GroupByMethods.time_method('int', 'sem')
+ 901.00μs 1.21ms 1.34 groupby.GroupByMethods.time_method('float', 'sem')
+ 161.36μs 203.44μs 1.26 groupby.GroupByMethods.time_method('int', 'min')
+ 200.87ms 235.93ms 1.17 groupby.GroupByMethods.time_method('float', 'all')
+ 64.93μs 75.73μs 1.17 groupby.GroupByMethods.time_method('int', 'size')
+ 299.42μs 345.95μs 1.16 groupby.GroupByMethods.time_method('int', 'std')
+ 201.29ms 231.93ms 1.15 groupby.GroupByMethods.time_method('float', 'any')
+ 579.50ms 659.22ms 1.14 groupby.GroupByMethods.time_method('int', 'pct_change')
+ 2.84s 3.19s 1.12 groupby.GroupByMethods.time_method('float', 'describe')
+ 192.43μs 215.57μs 1.12 groupby.GroupByMethods.time_method('int', 'cumcount')
+ 319.90μs 356.71μs 1.12 groupby.GroupByMethods.time_method('int', 'median')
+ 156.03μs 173.91μs 1.11 groupby.GroupByMethods.time_method('float', 'min')
+ 121.26μs 134.29μs 1.11 groupby.GroupByMethods.time_method('int', 'shift')
+ 173.37μs 191.73μs 1.11 groupby.GroupByMethods.time_method('float', 'var')
- 193.48ms 795.71μs 0.00 groupby.GroupByMethods.time_method('int', 'rank')
- 301.72ms 787.03μs 0.00 groupby.GroupByMethods.time_method('float', 'rank')
SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY. |
pandas/tests/groupby/test_groupby.py
Outdated
]) | ||
def test_rank_args(self, grps, vals, ties_method, ascending, pct, exp): | ||
if ties_method == 'first' and vals[0] == 'bar': | ||
pytest.xfail("See GH 19482") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't believe this would actually fail but as noted in #19482 the current behavior is not ideal, so I didn't bother to emulate that. I figure this is better left to being fixed to raise in a separate change
Codecov Report
@@ Coverage Diff @@
## master #19481 +/- ##
==========================================
+ Coverage 91.59% 91.62% +0.02%
==========================================
Files 150 150
Lines 48795 48803 +8
==========================================
+ Hits 44696 44715 +19
+ Misses 4099 4088 -11
Continue to review full report at Codecov.
|
pandas/_libs/algos.pxd
Outdated
@@ -11,3 +11,11 @@ cdef inline Py_ssize_t swap(numeric *a, numeric *b) nogil: | |||
a[0] = b[0] | |||
b[0] = t | |||
return 0 | |||
|
|||
cdef: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you might be able to make these an enum
pandas/_libs/groupby.pyx
Outdated
ndarray[object, ndim=2] values, | ||
ndarray[int64_t] labels, | ||
bint is_datetimelike, object ties_method, | ||
bint ascending, bint pct, object na_option): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ideally could incorporate this in the templated one. though I would just raise on object dtype. seems odd to support this (as its a very specific encoding, meaning lexical).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Simply raising on an object type would be easy. That said, wouldn't that then make this inconsistent with the nth
/ first
/ last
functions?
It seems strange to me if we end up allowing users to pick the nth
object from a group of objects but do not allow them to rank those same objects
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can ignore this comment. Was thinking at the time that nth
had to do with the value of an item but it instead has to do with the position. I'll push another PR raising for objects. Will have to communicate as a breaking change in case anyone had previously been using that behavior
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback just scoping this out I think this is a fairly significant change. The main problem I see is that the current object ranking implementation is used by Categoricals as well. I imagine we'd have to then update it so that ordered Categoricals can use rank but non-ordered Categoricals cannot. Additionally would have to update the code across Series, Frame and GroupBy objects to ensure those handle consistently.
I think that strays a little too far from the original goal here of optimizing performance and would rather handle in a separate change. Let me know if you disagree otherwise happy to open that one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
absolutely. let's keep this to one item for this PR (you can open a separate issue for other)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opened #19560 for this issue. Could use some further discussion.
Otherwise if this round of tests pass I plan to update whatsnew and wrap this one up. If there's anything outstanding please let me know
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to raise on groupby_object_rank rather than implement it here. let's keep that for another discussion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would simply not define it at all. This will be caught at a higher level (IOW it will try to compose the function name it won't exist), I think the error is ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we have benchmarks for groupby-rank in asv? if not can you add, post results anyhow
pandas/_libs/groupby.pyx
Outdated
ndarray[object, ndim=2] values, | ||
ndarray[int64_t] labels, | ||
bint is_datetimelike, object ties_method, | ||
bint ascending, bint pct, object na_option): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to raise on groupby_object_rank rather than implement it here. let's keep that for another discussion.
@@ -16,14 +16,20 @@ from numpy cimport (ndarray, | |||
from libc.stdlib cimport malloc, free | |||
|
|||
from util cimport numeric, get_nat | |||
from algos cimport swap | |||
from algos import take_2d_axis1_float64_float64, groupsort_indexer | |||
from algos cimport (swap, TiebreakEnumType, TIEBREAK_AVERAGE, TIEBREAK_MIN, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe import directly as the name of the enum?
can you also share this name with the non-grouping rank functions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried this back in d09268b but all of the Py27 builds failed as a result with the error below (py3 was fine). Tried digging up info but couldn't find anything - any chance you've seen this before?
cythoning pandas/_libs/groupby.pyx to pandas/_libs/groupby.c
Error compiling Cython file:
------------------------------------------------------------
...
if keep_na and (values[_as[i], 0] != values[_as[i], 0]):
grp_na_count += 1
out[_as[i], 0] = np.nan
else:
if tiebreak == TiebreakEnumType.TIEBREAK_AVERAGE:
^
------------------------------------------------------------
pandas/_libs/groupby.pyx:178:43: Compiler crash in AnalyseExpressionsTransform
ModuleNode.body = StatListNode(groupby.pyx:4:0)
StatListNode.stats[16] = StatListNode(groupby.pyx:125:0)
StatListNode.stats[0] = CompilerDirectivesNode(groupby.pyx:125:0)
CompilerDirectivesNode.body = StatListNode(groupby.pyx:125:0)
StatListNode.stats[0] = DefNode(groupby.pyx:125:0,
doc = u'\n Only transforms on axis=0\n ',
modifiers = [...]/0,
name = u'group_rank_object',
num_required_args = 8,
py_wrapper_required = True,
reqd_kw_flags_cname = '0',
used = True)
File 'Nodes.py', line 430, in analyse_expressions: StatListNode(groupby.pyx:130:4)
File 'Nodes.py', line 430, in analyse_expressions: StatListNode(groupby.pyx:170:4)
File 'Nodes.py', line 6181, in analyse_expressions: ForInStatNode(groupby.pyx:170:4)
File 'Nodes.py', line 430, in analyse_expressions: StatListNode(groupby.pyx:171:8)
File 'Nodes.py', line 5842, in analyse_expressions: IfStatNode(groupby.pyx:174:8)
File 'Nodes.py', line 430, in analyse_expressions: StatListNode(groupby.pyx:178:12)
File 'Nodes.py', line 5840, in analyse_expressions: IfStatNode(groupby.pyx:178:12)
File 'Nodes.py', line 5885, in analyse_expressions: IfClauseNode(groupby.pyx:178:15)
File 'ExprNodes.py', line 541, in analyse_temp_boolean_expression: PrimaryCmpNode(groupby.pyx:178:24,
operator = u'==',
result_is_used = True,
use_managed_ref = True)
File 'ExprNodes.py', line 11893, in analyse_types: PrimaryCmpNode(groupby.pyx:178:24,
operator = u'==',
result_is_used = True,
use_managed_ref = True)
File 'ExprNodes.py', line 6329, in analyse_types: AttributeNode(groupby.pyx:178:43,
attribute = u'TIEBREAK_AVERAGE',
initialized_check = True,
is_attribute = 1,
needs_none_check = True,
result_is_used = True,
use_managed_ref = True)
File 'ExprNodes.py', line 6395, in analyse_as_type_attribute: AttributeNode(groupby.pyx:178:43,
attribute = u'TIEBREAK_AVERAGE',
initialized_check = True,
is_attribute = 1,
needs_none_check = True,
result_is_used = True,
use_managed_ref = True)
File 'ExprNodes.py', line 6438, in as_name_node: AttributeNode(groupby.pyx:178:43,
attribute = u'TIEBREAK_AVERAGE',
initialized_check = True,
is_attribute = 1,
needs_none_check = True,
result_is_used = True,
use_managed_ref = True)
File 'ExprNodes.py', line 1906, in analyse_rvalue_entry: NameNode(groupby.pyx:178:43,
cf_maybe_null = True,
is_name = True,
name = u'TIEBREAK_AVERAGE',
result_is_used = True,
use_managed_ref = True)
File 'ExprNodes.py', line 1939, in analyse_entry: NameNode(groupby.pyx:178:43,
cf_maybe_null = True,
is_name = True,
name = u'TIEBREAK_AVERAGE',
result_is_used = True,
use_managed_ref = True)
File 'ExprNodes.py', line 1953, in check_identifier_kind: NameNode(groupby.pyx:178:43,
cf_maybe_null = True,
is_name = True,
name = u'TIEBREAK_AVERAGE',
result_is_used = True,
use_managed_ref = True)
Compiler crash traceback from this point on:
File "/Users/williamayd/miniconda3/envs/pandas_dev2/lib/python2.7/site-packages/Cython/Compiler/ExprNodes.py", line 1953, in check_identifier_kind
if entry.is_type and entry.type.is_extension_type:
AttributeError: 'NoneType' object has no attribute 'is_type'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thats annoying. havne't seen that
pandas/core/groupby.py
Outdated
@@ -2159,6 +2172,15 @@ def get_group_levels(self): | |||
# ------------------------------------------------------------ | |||
# Aggregation functions | |||
|
|||
def _group_rank_wrapper(func, *args, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would encode this in _cython_functions directly
@@ -2314,10 +2341,13 @@ def _cython_operation(self, kind, values, how, axis, min_count=-1): | |||
else: | |||
raise | |||
|
|||
if is_numeric: | |||
out_dtype = '%s%d' % (values.dtype.kind, values.dtype.itemsize) | |||
if how == 'rank': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should prob have a better way of doing this :<
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The catch I see here is that unlike other groupby transformations (where the dtype of the result
object typically matches the dtype of the groups' values and gets cast as necessary after transformation) the dtype of the result object for rank needs to be a float prior to calling any group_rank
operation. Otherwise, things like TIEBREAK_AVERAGE will not work when ranking say a group full of ints.
The generic calls to algos
have it a little bit easier because they don't pass an object in to be modified, using instead the return value which algos
builds internally as a like-sized array of floats. Unless there's something basic I don't understand with Cython, I don't think there's a way to upcast the dtype of the return
object that gets passed into group_rank
from say an int to a float, so the only option I can think of there would be to break out group_rank
from other transformations and have group_rank
return a new result
object instead of modifying the provided one in place.
I'd argue the effort and maintainability of doing that would far outweigh any benefit from cleaning up this conditional, but curious if you have other thoughts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is similar in a way to the pre-transformations we do for datetimelikes. What I meant is that the registry of _cython_functions should have this logic itself (which is already based on the function name).
e.g. maybe in _cython_functions you get back a tuple rather than a scalar / lamba function, which includes the required dtype (or None). Or _cython_functions actually should be a class which is dispatched to, to call the functions themselves. IOW something like
class CythonTransformFunction:
def __init__(self, obj):
self.obj = obj
def cummax(self):
return group_cummax(self.obj.values)
def rank(self):
return group_cummax(self.obj.values.astype('floatt64'))
(and really should have more logic pushed to this class, e.g. pre-and post-convert dtype things. And should have a Transformation and an Aggregation version. This would all of the specific logic to be pushed here, rather than lookup in dicts and such. Further we could prob move this code out of the main groupby.py class into a sub-module.
don't have to do this here, but groupby for sure needs cleanup like this.
I posted AVSs for a previous commit. Trying to repost for latest but they aren't working any more. I think 3597de0 broke the ASVs by removing File "/Users/williamayd/Git/pandas/asv_bench/benchmarks/gil.py", line 3, in <module>
from pandas import (DataFrame, Series, rolling_median, rolling_mean,
ImportError: cannot import name 'rolling_median' Need to debug a little more but will probably have to open a separate issue to fix and get those working again EDIT: looks like #19236 is supposed to fix the ASV issue. Can repost results after that gets merged |
@@ -16,14 +16,20 @@ from numpy cimport (ndarray, | |||
from libc.stdlib cimport malloc, free | |||
|
|||
from util cimport numeric, get_nat | |||
from algos cimport swap | |||
from algos import take_2d_axis1_float64_float64, groupsort_indexer | |||
from algos cimport (swap, TiebreakEnumType, TIEBREAK_AVERAGE, TIEBREAK_MIN, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thats annoying. havne't seen that
pandas/_libs/groupby.pyx
Outdated
ndarray[object, ndim=2] values, | ||
ndarray[int64_t] labels, | ||
bint is_datetimelike, object ties_method, | ||
bint ascending, bint pct, object na_option): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would simply not define it at all. This will be caught at a higher level (IOW it will try to compose the function name it won't exist), I think the error is ok.
pandas/_libs/groupby_helper.pxi.in
Outdated
bint is_datetimelike, object ties_method, | ||
bint ascending, bint pct, object na_option): | ||
""" | ||
Only transforms on axis=0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a more full doc-string
for j in range(i - dups + 1, i + 1): | ||
if ascending: | ||
out[_as[j], 0] = j + 1 - grp_start | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any comments on the impl would be helpful (to future readers)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice comments!
pandas/core/groupby.py
Outdated
@Appender(_doc_template) | ||
def rank(self, method='average', ascending=True, na_option='keep', | ||
pct=False, axis=0): | ||
"""Rank within each group""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this doc-string look ok?
@@ -2314,10 +2341,13 @@ def _cython_operation(self, kind, values, how, axis, min_count=-1): | |||
else: | |||
raise | |||
|
|||
if is_numeric: | |||
out_dtype = '%s%d' % (values.dtype.kind, values.dtype.itemsize) | |||
if how == 'rank': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is similar in a way to the pre-transformations we do for datetimelikes. What I meant is that the registry of _cython_functions should have this logic itself (which is already based on the function name).
e.g. maybe in _cython_functions you get back a tuple rather than a scalar / lamba function, which includes the required dtype (or None). Or _cython_functions actually should be a class which is dispatched to, to call the functions themselves. IOW something like
class CythonTransformFunction:
def __init__(self, obj):
self.obj = obj
def cummax(self):
return group_cummax(self.obj.values)
def rank(self):
return group_cummax(self.obj.values.astype('floatt64'))
(and really should have more logic pushed to this class, e.g. pre-and post-convert dtype things. And should have a Transformation and an Aggregation version. This would all of the specific logic to be pushed here, rather than lookup in dicts and such. Further we could prob move this code out of the main groupby.py class into a sub-module.
don't have to do this here, but groupby for sure needs cleanup like this.
Latest ASVs posted below before after ratio
[b8351277] [0141747d]
+ 145±2μs 479±2μs 3.29 groupby.GroupByMethods.time_method('int', 'cummax')
+ 144±0.8μs 469±6μs 3.26 groupby.GroupByMethods.time_method('int', 'cummin')
+ 148±2μs 479±8μs 3.24 groupby.GroupByMethods.time_method('float', 'cummax')
+ 160±0.9μs 493±10μs 3.08 groupby.GroupByMethods.time_method('float', 'cumsum')
+ 155±6μs 478±20μs 3.08 groupby.GroupByMethods.time_method('float', 'cummin')
+ 169±2μs 484±1μs 2.87 groupby.GroupByMethods.time_method('int', 'cumsum')
+ 557±3μs 1.14±0.2ms 2.05 groupby.GroupByMethods.time_method('int', 'sem')
+ 308±5μs 626±2μs 2.03 groupby.GroupByMethods.time_method('int', 'cumprod')
+ 353±8μs 636±5μs 1.80 groupby.GroupByMethods.time_method('float', 'cumprod')
+ 149±0.7μs 192±1μs 1.29 groupby.GroupByMethods.time_method('float', 'min')
+ 300±2μs 380±10μs 1.27 groupby.GroupByMethods.time_method('float', 'head')
+ 312±7μs 385±2μs 1.24 groupby.GroupByMethods.time_method('int', 'std')
+ 149±1μs 180±8μs 1.21 groupby.GroupByMethods.time_method('float', 'last')
+ 335±2μs 403±7μs 1.20 groupby.GroupByMethods.time_method('float', 'median')
+ 353±0.9μs 420±10μs 1.19 groupby.GroupByMethods.time_method('int', 'nunique')
+ 484±5μs 570±20μs 1.18 groupby.GroupByMethods.time_method('float', 'sem')
+ 914ms 1.04s 1.14 groupby.GroupByMethods.time_method('float', 'mad')
+ 889ms 1.00s 1.13 groupby.GroupByMethods.time_method('float', 'pct_change')
+ 2.93s 3.28s 1.12 groupby.GroupByMethods.time_method('float', 'describe')
+ 214±4ms 237±10ms 1.11 groupby.GroupByMethods.time_method('int', 'skew')
- 1.98s 1.77s 0.89 groupby.GroupByMethods.time_method('int', 'describe')
- 178±5μs 157±0.7μs 0.89 groupby.GroupByMethods.time_method('int', 'first')
- 187±0.7ms 772±20μs 0.00 groupby.GroupByMethods.time_method('int', 'rank')
- 293±0.9ms 793±40μs 0.00 groupby.GroupByMethods.time_method('float', 'rank')
SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this might also close #11759 (as you added the full signature). if so can you add a test?
pandas/_libs/groupby.pyx
Outdated
|
||
cdef int64_t iNaT = get_nat() | ||
|
||
cdef double NaN = <double> np.NaN | ||
cdef double nan = NaN | ||
|
||
cdef extern from "numpy/npy_math.h" nogil: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do you need these? you can just use np.isnan
no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was trying to do the nan check inside the nogil block hence the need for it, but your comment makes me realize that I already am creating a mask array that I can reference instead of doing that check again within the nogil.
for j in range(i - dups + 1, i + 1): | ||
if ascending: | ||
out[_as[j], 0] = j + 1 - grp_start | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice comments!
pandas/_libs/groupby_helper.pxi.in
Outdated
# also be sure to reset any of the items helping to calculate dups | ||
if i == N - 1 or labels[_as[i]] != labels[_as[i+1]]: | ||
if pct: | ||
for j in range(grp_start, i + 1): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the pct here is confusing. just fill another variable with the group sizes as you go then do the division at the very end.
Missed your note on #11759 before latest commit. With the example provided there, this fix would now be raising |
thanks! ok ideally have a better message for unsupported groupbys, but can address then with a followup. |
https://travis-ci.org/MacPython/pandas-wheels/jobs/340767780 is a 32-bit build which is failing sample error
not sure exactly what is happening, if you can have a look. note that it is almost impossible to actually test this w/o merging. these are built once a day. once we have a fix which passes regular travis then can go from there. thanks. |
EDIT: Ignore below comment - didn't look closely enough at traceback you provided where Judging by the line number I think this may be a side effect of #19635. Previously there was the following condition preceding that: {{if name == 'int64'}}
if val != {{nan_val}}:
{{else}}
if val == val and val != {{nan_val}}:
{{endif}} Which I simplified to if val == val and val != {{nan_val}}: I can't imagine why that would make a difference but it is suspicious given location of the error. Want me to add that condition back in and submit it referencing this PR? |
yep thanks |
@WillAyd not sure if it is related to this PR, but on AppVeyor, there are a huge amount of numpy warnings ("DeprecationWarning: numpy boolean subtract, the See https://ci.appveyor.com/project/pandas-dev/pandas/build/1.0.13483/job/vt1rpmyub41gf6aa |
From a quick manual appveyor bisect: introduced by #20091 |
cc @peterpanmj |
I think this warning is coming from It looks like in 1.13 this was actually performing subtraction regardless of type, so perhaps something that just slipped through on that release? |
Yes, this is a result of np.diff in 1.13. I use np.dfiff in #20091. Actually something like ^ operator can be used instead. |
git diff upstream/master -u -- "*.py" | flake8 --diff
This is not completebut I wanted to submit for review on the direction. In particular, I wanted to know if my way of passing the named rank arguments back to the Cython layer makes sense, or if we'd rather bypass using kwargs and call that function directly from theGroupBy
instance method (similar to howshift
does it).Right now this only increments values ascending, doesn't handle tiebreakers, nor does it allow for the return of a percentage. I also plan on adding some test cases to cover the arguments as I can't find them in the
pandas.tests.groupby
package. More to come but again wanted feedback before going too far in