PERF: Cythonize Groupby Rank #19481

WillAyd · 2018-01-31T19:21:31Z

closes PERF: cythonize groupby-rank #15779
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

~~This is not complete~~ but I wanted to submit for review on the direction. In particular, I wanted to know if my way of passing the named rank arguments back to the Cython layer makes sense, or if we'd rather bypass using kwargs and call that function directly from the GroupBy instance method (similar to how shift does it).

Right now this only increments values ascending, doesn't handle tiebreakers, nor does it allow for the return of a percentage. I also plan on adding some test cases to cover the arguments as I can't find them in the pandas.tests.groupby package. More to come but again wanted feedback before going too far in

pep8speaks · 2018-01-31T19:21:33Z

Hello @WillAyd! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on February 09, 2018 at 18:37 Hours UTC

jreback · 2018-02-01T12:52:07Z

pandas/_libs/groupby_helper.pxi.in

+                        ndarray[{{c_type}}, ndim=2] values,
+                        ndarray[int64_t] labels,
+                        bint is_datetimelike, **kwargs):
+    """


looks ok. don't pass kwargs generally around (esp in cython). we want to specify the args directly. this may make the call slightly more tricky in cython_transform (e.g. you may need to use a partial or lambda to hold the additional args)

WillAyd · 2018-02-05T21:31:01Z

Still not complete but sharing in interim for any code review as this change is large. At a minimum I need to go back and use something besides kwargs to pass the appropriate values back

WillAyd · 2018-02-05T21:56:33Z

pandas/_libs/groupby.pyx

+                    out[_as[j], 0] = i - grp_start + 1
+            elif tiebreak == TIEBREAK_FIRST:
+                for j in range(i - dups + 1, i + 1):
+                    if ascending:


This could arguably also be done using the TIEBREAK_FIRST_DESCENDING flag, but I figured it was worth just using TIEBREAK_FIRST and adding a conditional given the former is what is being passed

WillAyd · 2018-02-05T21:58:33Z

pandas/_libs/groupby.pyx

+            grp_na_count += 1
+            out[_as[i], 0] = np.nan
+        else:
+            if tiebreak == TIEBREAK_AVERAGE:


In general this looping mechanism isn't very efficient because it overwrites "duplicate" values continually in a loop. Given the benchmarks were still significantly faster I left as is and was planning to open up a separate change to optimize further, but could review as part of this if you feel this loop mechanism is not acceptable

WillAyd · 2018-02-05T22:00:30Z

pandas/_libs/groupby_helper.pxi.in

+            dups += 1
+            sum_ranks += i - grp_start + 1
+
+            if keep_na and masked_vals[_as[i]] == nan_fill_val:


This comparison to the nan_fill_val gets a little tricky because it obfuscates np.nan with np.inf (or whatever fill value is being used). That said, that is a pre-existing limitation which I opened #19538 to address

WillAyd · 2018-02-05T22:02:43Z

pandas/_libs/groupby.pyx

+    ascending = kwargs['ascending']
+    pct = kwargs['pct']
+    keep_na = kwargs['na_option'] == 'keep'
+    N, K = (<object> values).shape


Due to my limited understanding I am not using the K value that gets extracted here, as I couldn't figure out under what circumstance K was ever not equal to 0. Can you advise how that works or what to look at to help my comprehension?

WillAyd · 2018-02-05T22:04:44Z

Benchmarks provided below for reference

Before     after       ratio
  [f391cbfe] [27cb6b57]
+  139.69μs   535.43μs      3.83  groupby.GroupByMethods.time_method('float', 'cummin')
+  142.06μs   522.73μs      3.68  groupby.GroupByMethods.time_method('float', 'cummax')
+  145.02μs   527.11μs      3.63  groupby.GroupByMethods.time_method('int', 'cummin')
+  162.18μs   574.63μs      3.54  groupby.GroupByMethods.time_method('float', 'cumsum')
+  145.86μs   512.28μs      3.51  groupby.GroupByMethods.time_method('int', 'cummax')
+  159.64μs   533.92μs      3.34  groupby.GroupByMethods.time_method('int', 'cumsum')
+  311.50μs   728.59μs      2.34  groupby.GroupByMethods.time_method('float', 'cumprod')
+  315.28μs   650.14μs      2.06  groupby.GroupByMethods.time_method('int', 'cumprod')
+  872.00μs     1.24ms      1.42  groupby.GroupByMethods.time_method('int', 'sem')
+  901.00μs     1.21ms      1.34  groupby.GroupByMethods.time_method('float', 'sem')
+  161.36μs   203.44μs      1.26  groupby.GroupByMethods.time_method('int', 'min')
+  200.87ms   235.93ms      1.17  groupby.GroupByMethods.time_method('float', 'all')
+   64.93μs    75.73μs      1.17  groupby.GroupByMethods.time_method('int', 'size')
+  299.42μs   345.95μs      1.16  groupby.GroupByMethods.time_method('int', 'std')
+  201.29ms   231.93ms      1.15  groupby.GroupByMethods.time_method('float', 'any')
+  579.50ms   659.22ms      1.14  groupby.GroupByMethods.time_method('int', 'pct_change')
+     2.84s      3.19s      1.12  groupby.GroupByMethods.time_method('float', 'describe')
+  192.43μs   215.57μs      1.12  groupby.GroupByMethods.time_method('int', 'cumcount')
+  319.90μs   356.71μs      1.12  groupby.GroupByMethods.time_method('int', 'median')
+  156.03μs   173.91μs      1.11  groupby.GroupByMethods.time_method('float', 'min')
+  121.26μs   134.29μs      1.11  groupby.GroupByMethods.time_method('int', 'shift')
+  173.37μs   191.73μs      1.11  groupby.GroupByMethods.time_method('float', 'var')
-  193.48ms   795.71μs      0.00  groupby.GroupByMethods.time_method('int', 'rank')
-  301.72ms   787.03μs      0.00  groupby.GroupByMethods.time_method('float', 'rank')

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

WillAyd · 2018-02-05T22:06:38Z

pandas/tests/groupby/test_groupby.py

+    ])
+    def test_rank_args(self, grps, vals, ties_method, ascending, pct, exp):
+        if ties_method == 'first' and vals[0] == 'bar':
+            pytest.xfail("See GH 19482")


I don't believe this would actually fail but as noted in #19482 the current behavior is not ideal, so I didn't bother to emulate that. I figure this is better left to being fixed to raise in a separate change

codecov · 2018-02-06T02:27:48Z

Codecov Report

Merging #19481 into master will increase coverage by 0.02%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #19481      +/-   ##
==========================================
+ Coverage   91.59%   91.62%   +0.02%     
==========================================
  Files         150      150              
  Lines       48795    48803       +8     
==========================================
+ Hits        44696    44715      +19     
+ Misses       4099     4088      -11

Flag	Coverage Δ
#multiple	`89.99% <100%> (+0.02%)`	⬆️
#single	`41.72% <33.33%> (-0.01%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/groupby.py	`92.2% <100%> (+0.07%)`	⬆️
pandas/util/testing.py	`83.64% <0%> (-0.21%)`	⬇️
pandas/plotting/_converter.py	`66.95% <0%> (+1.73%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a214915...aa4578d. Read the comment docs.

jreback · 2018-02-06T10:55:34Z

pandas/_libs/algos.pxd

@@ -11,3 +11,11 @@ cdef inline Py_ssize_t swap(numeric *a, numeric *b) nogil:
    a[0] = b[0]
    b[0] = t
    return 0
+
+cdef:


you might be able to make these an enum

jreback · 2018-02-06T10:58:17Z

pandas/_libs/groupby.pyx

+                      ndarray[object, ndim=2] values,
+                      ndarray[int64_t] labels,
+                      bint is_datetimelike, object ties_method,
+                      bint ascending, bint pct, object na_option):


ideally could incorporate this in the templated one. though I would just raise on object dtype. seems odd to support this (as its a very specific encoding, meaning lexical).

Simply raising on an object type would be easy. That said, wouldn't that then make this inconsistent with the nth / first / last functions?

It seems strange to me if we end up allowing users to pick the nth object from a group of objects but do not allow them to rank those same objects

Can ignore this comment. Was thinking at the time that nth had to do with the value of an item but it instead has to do with the position. I'll push another PR raising for objects. Will have to communicate as a breaking change in case anyone had previously been using that behavior

@jreback just scoping this out I think this is a fairly significant change. The main problem I see is that the current object ranking implementation is used by Categoricals as well. I imagine we'd have to then update it so that ordered Categoricals can use rank but non-ordered Categoricals cannot. Additionally would have to update the code across Series, Frame and GroupBy objects to ensure those handle consistently.

I think that strays a little too far from the original goal here of optimizing performance and would rather handle in a separate change. Let me know if you disagree otherwise happy to open that one.

absolutely. let's keep this to one item for this PR (you can open a separate issue for other)

Opened #19560 for this issue. Could use some further discussion.

Otherwise if this round of tests pass I plan to update whatsnew and wrap this one up. If there's anything outstanding please let me know

I would like to raise on groupby_object_rank rather than implement it here. let's keep that for another discussion.

I would simply not define it at all. This will be caught at a higher level (IOW it will try to compose the function name it won't exist), I think the error is ok.

jreback

do we have benchmarks for groupby-rank in asv? if not can you add, post results anyhow

jreback · 2018-02-07T12:33:44Z

pandas/_libs/groupby.pyx

+                      ndarray[object, ndim=2] values,
+                      ndarray[int64_t] labels,
+                      bint is_datetimelike, object ties_method,
+                      bint ascending, bint pct, object na_option):


I would like to raise on groupby_object_rank rather than implement it here. let's keep that for another discussion.

jreback · 2018-02-07T12:37:16Z

pandas/_libs/groupby.pyx

@@ -16,14 +16,20 @@ from numpy cimport (ndarray,
 from libc.stdlib cimport malloc, free

 from util cimport numeric, get_nat
-from algos cimport swap
-from algos import take_2d_axis1_float64_float64, groupsort_indexer
+from algos cimport (swap, TiebreakEnumType, TIEBREAK_AVERAGE, TIEBREAK_MIN,


maybe import directly as the name of the enum?

can you also share this name with the non-grouping rank functions?

I tried this back in d09268b but all of the Py27 builds failed as a result with the error below (py3 was fine). Tried digging up info but couldn't find anything - any chance you've seen this before?

cythoning pandas/_libs/groupby.pyx to pandas/_libs/groupby.c Error compiling Cython file: ------------------------------------------------------------ ... if keep_na and (values[_as[i], 0] != values[_as[i], 0]): grp_na_count += 1 out[_as[i], 0] = np.nan else: if tiebreak == TiebreakEnumType.TIEBREAK_AVERAGE: ^ ------------------------------------------------------------ pandas/_libs/groupby.pyx:178:43: Compiler crash in AnalyseExpressionsTransform ModuleNode.body = StatListNode(groupby.pyx:4:0) StatListNode.stats[16] = StatListNode(groupby.pyx:125:0) StatListNode.stats[0] = CompilerDirectivesNode(groupby.pyx:125:0) CompilerDirectivesNode.body = StatListNode(groupby.pyx:125:0) StatListNode.stats[0] = DefNode(groupby.pyx:125:0, doc = u'\n Only transforms on axis=0\n ', modifiers = [...]/0, name = u'group_rank_object', num_required_args = 8, py_wrapper_required = True, reqd_kw_flags_cname = '0', used = True) File 'Nodes.py', line 430, in analyse_expressions: StatListNode(groupby.pyx:130:4) File 'Nodes.py', line 430, in analyse_expressions: StatListNode(groupby.pyx:170:4) File 'Nodes.py', line 6181, in analyse_expressions: ForInStatNode(groupby.pyx:170:4) File 'Nodes.py', line 430, in analyse_expressions: StatListNode(groupby.pyx:171:8) File 'Nodes.py', line 5842, in analyse_expressions: IfStatNode(groupby.pyx:174:8) File 'Nodes.py', line 430, in analyse_expressions: StatListNode(groupby.pyx:178:12) File 'Nodes.py', line 5840, in analyse_expressions: IfStatNode(groupby.pyx:178:12) File 'Nodes.py', line 5885, in analyse_expressions: IfClauseNode(groupby.pyx:178:15) File 'ExprNodes.py', line 541, in analyse_temp_boolean_expression: PrimaryCmpNode(groupby.pyx:178:24, operator = u'==', result_is_used = True, use_managed_ref = True) File 'ExprNodes.py', line 11893, in analyse_types: PrimaryCmpNode(groupby.pyx:178:24, operator = u'==', result_is_used = True, use_managed_ref = True) File 'ExprNodes.py', line 6329, in analyse_types: AttributeNode(groupby.pyx:178:43, attribute = u'TIEBREAK_AVERAGE', initialized_check = True, is_attribute = 1, needs_none_check = True, result_is_used = True, use_managed_ref = True) File 'ExprNodes.py', line 6395, in analyse_as_type_attribute: AttributeNode(groupby.pyx:178:43, attribute = u'TIEBREAK_AVERAGE', initialized_check = True, is_attribute = 1, needs_none_check = True, result_is_used = True, use_managed_ref = True) File 'ExprNodes.py', line 6438, in as_name_node: AttributeNode(groupby.pyx:178:43, attribute = u'TIEBREAK_AVERAGE', initialized_check = True, is_attribute = 1, needs_none_check = True, result_is_used = True, use_managed_ref = True) File 'ExprNodes.py', line 1906, in analyse_rvalue_entry: NameNode(groupby.pyx:178:43, cf_maybe_null = True, is_name = True, name = u'TIEBREAK_AVERAGE', result_is_used = True, use_managed_ref = True) File 'ExprNodes.py', line 1939, in analyse_entry: NameNode(groupby.pyx:178:43, cf_maybe_null = True, is_name = True, name = u'TIEBREAK_AVERAGE', result_is_used = True, use_managed_ref = True) File 'ExprNodes.py', line 1953, in check_identifier_kind: NameNode(groupby.pyx:178:43, cf_maybe_null = True, is_name = True, name = u'TIEBREAK_AVERAGE', result_is_used = True, use_managed_ref = True) Compiler crash traceback from this point on: File "/Users/williamayd/miniconda3/envs/pandas_dev2/lib/python2.7/site-packages/Cython/Compiler/ExprNodes.py", line 1953, in check_identifier_kind if entry.is_type and entry.type.is_extension_type: AttributeError: 'NoneType' object has no attribute 'is_type'

thats annoying. havne't seen that

jreback · 2018-02-07T12:39:05Z

pandas/core/groupby.py

@@ -2159,6 +2172,15 @@ def get_group_levels(self):
    # ------------------------------------------------------------
    # Aggregation functions

+    def _group_rank_wrapper(func, *args, **kwargs):


I would encode this in _cython_functions directly

jreback · 2018-02-07T12:39:25Z

pandas/core/groupby.py

@@ -2314,10 +2341,13 @@ def _cython_operation(self, kind, values, how, axis, min_count=-1):
            else:
                raise

-        if is_numeric:
-            out_dtype = '%s%d' % (values.dtype.kind, values.dtype.itemsize)
+        if how == 'rank':


we should prob have a better way of doing this :<

The catch I see here is that unlike other groupby transformations (where the dtype of the result object typically matches the dtype of the groups' values and gets cast as necessary after transformation) the dtype of the result object for rank needs to be a float prior to calling any group_rank operation. Otherwise, things like TIEBREAK_AVERAGE will not work when ranking say a group full of ints.

The generic calls to algos have it a little bit easier because they don't pass an object in to be modified, using instead the return value which algos builds internally as a like-sized array of floats. Unless there's something basic I don't understand with Cython, I don't think there's a way to upcast the dtype of the return object that gets passed into group_rank from say an int to a float, so the only option I can think of there would be to break out group_rank from other transformations and have group_rank return a new result object instead of modifying the provided one in place.

I'd argue the effort and maintainability of doing that would far outweigh any benefit from cleaning up this conditional, but curious if you have other thoughts

this is similar in a way to the pre-transformations we do for datetimelikes. What I meant is that the registry of _cython_functions should have this logic itself (which is already based on the function name).

e.g. maybe in _cython_functions you get back a tuple rather than a scalar / lamba function, which includes the required dtype (or None). Or _cython_functions actually should be a class which is dispatched to, to call the functions themselves. IOW something like

class CythonTransformFunction: def __init__(self, obj): self.obj = obj def cummax(self): return group_cummax(self.obj.values) def rank(self): return group_cummax(self.obj.values.astype('floatt64'))

(and really should have more logic pushed to this class, e.g. pre-and post-convert dtype things. And should have a Transformation and an Aggregation version. This would all of the specific logic to be pushed here, rather than lookup in dicts and such. Further we could prob move this code out of the main groupby.py class into a sub-module.

don't have to do this here, but groupby for sure needs cleanup like this.

WillAyd · 2018-02-07T21:10:53Z

I posted AVSs for a previous commit. Trying to repost for latest but they aren't working any more. I think 3597de0 broke the ASVs by removing rolling_median and potentially other functions, which now yields the following trying to run benchmarks:

File "/Users/williamayd/Git/pandas/asv_bench/benchmarks/gil.py", line 3, in <module>
  from pandas import (DataFrame, Series, rolling_median, rolling_mean,
ImportError: cannot import name 'rolling_median'

Need to debug a little more but will probably have to open a separate issue to fix and get those working again

EDIT: looks like #19236 is supposed to fix the ASV issue. Can repost results after that gets merged

jreback · 2018-02-08T01:22:02Z

pandas/_libs/groupby.pyx

@@ -16,14 +16,20 @@ from numpy cimport (ndarray,
 from libc.stdlib cimport malloc, free

 from util cimport numeric, get_nat
-from algos cimport swap
-from algos import take_2d_axis1_float64_float64, groupsort_indexer
+from algos cimport (swap, TiebreakEnumType, TIEBREAK_AVERAGE, TIEBREAK_MIN,


thats annoying. havne't seen that

jreback · 2018-02-08T01:22:53Z

pandas/_libs/groupby.pyx

+                      ndarray[object, ndim=2] values,
+                      ndarray[int64_t] labels,
+                      bint is_datetimelike, object ties_method,
+                      bint ascending, bint pct, object na_option):


I would simply not define it at all. This will be caught at a higher level (IOW it will try to compose the function name it won't exist), I think the error is ok.

jreback · 2018-02-08T01:23:07Z

pandas/_libs/groupby_helper.pxi.in

+                        bint is_datetimelike, object ties_method,
+                        bint ascending, bint pct, object na_option):
+    """
+    Only transforms on axis=0


can you add a more full doc-string

jreback · 2018-02-08T01:23:31Z

pandas/_libs/groupby_helper.pxi.in

+                    for j in range(i - dups + 1, i + 1):
+                        if ascending:
+                            out[_as[j], 0] = j + 1 - grp_start
+                        else:


any comments on the impl would be helpful (to future readers)

nice comments!

jreback · 2018-02-08T01:24:06Z

pandas/core/groupby.py

+    @Appender(_doc_template)
+    def rank(self, method='average', ascending=True, na_option='keep',
+             pct=False, axis=0):
+        """Rank within each group"""


does this doc-string look ok?

jreback · 2018-02-08T01:25:32Z

pandas/core/groupby.py

@@ -2314,10 +2341,13 @@ def _cython_operation(self, kind, values, how, axis, min_count=-1):
            else:
                raise

-        if is_numeric:
-            out_dtype = '%s%d' % (values.dtype.kind, values.dtype.itemsize)
+        if how == 'rank':


this is similar in a way to the pre-transformations we do for datetimelikes. What I meant is that the registry of _cython_functions should have this logic itself (which is already based on the function name).

e.g. maybe in _cython_functions you get back a tuple rather than a scalar / lamba function, which includes the required dtype (or None). Or _cython_functions actually should be a class which is dispatched to, to call the functions themselves. IOW something like

class CythonTransformFunction: def __init__(self, obj): self.obj = obj def cummax(self): return group_cummax(self.obj.values) def rank(self): return group_cummax(self.obj.values.astype('floatt64'))

(and really should have more logic pushed to this class, e.g. pre-and post-convert dtype things. And should have a Transformation and an Aggregation version. This would all of the specific logic to be pushed here, rather than lookup in dicts and such. Further we could prob move this code out of the main groupby.py class into a sub-module.

don't have to do this here, but groupby for sure needs cleanup like this.

WillAyd · 2018-02-08T20:08:41Z

Latest ASVs posted below

       before           after         ratio
     [b8351277]       [0141747d]
+         145±2μs          479±2μs     3.29  groupby.GroupByMethods.time_method('int', 'cummax')
+       144±0.8μs          469±6μs     3.26  groupby.GroupByMethods.time_method('int', 'cummin')
+         148±2μs          479±8μs     3.24  groupby.GroupByMethods.time_method('float', 'cummax')
+       160±0.9μs         493±10μs     3.08  groupby.GroupByMethods.time_method('float', 'cumsum')
+         155±6μs         478±20μs     3.08  groupby.GroupByMethods.time_method('float', 'cummin')
+         169±2μs          484±1μs     2.87  groupby.GroupByMethods.time_method('int', 'cumsum')
+         557±3μs       1.14±0.2ms     2.05  groupby.GroupByMethods.time_method('int', 'sem')
+         308±5μs          626±2μs     2.03  groupby.GroupByMethods.time_method('int', 'cumprod')
+         353±8μs          636±5μs     1.80  groupby.GroupByMethods.time_method('float', 'cumprod')
+       149±0.7μs          192±1μs     1.29  groupby.GroupByMethods.time_method('float', 'min')
+         300±2μs         380±10μs     1.27  groupby.GroupByMethods.time_method('float', 'head')
+         312±7μs          385±2μs     1.24  groupby.GroupByMethods.time_method('int', 'std')
+         149±1μs          180±8μs     1.21  groupby.GroupByMethods.time_method('float', 'last')
+         335±2μs          403±7μs     1.20  groupby.GroupByMethods.time_method('float', 'median')
+       353±0.9μs         420±10μs     1.19  groupby.GroupByMethods.time_method('int', 'nunique')
+         484±5μs         570±20μs     1.18  groupby.GroupByMethods.time_method('float', 'sem')
+           914ms            1.04s     1.14  groupby.GroupByMethods.time_method('float', 'mad')
+           889ms            1.00s     1.13  groupby.GroupByMethods.time_method('float', 'pct_change')
+           2.93s            3.28s     1.12  groupby.GroupByMethods.time_method('float', 'describe')
+         214±4ms         237±10ms     1.11  groupby.GroupByMethods.time_method('int', 'skew')
-           1.98s            1.77s     0.89  groupby.GroupByMethods.time_method('int', 'describe')
-         178±5μs        157±0.7μs     0.89  groupby.GroupByMethods.time_method('int', 'first')
-       187±0.7ms         772±20μs     0.00  groupby.GroupByMethods.time_method('int', 'rank')
-       293±0.9ms         793±40μs     0.00  groupby.GroupByMethods.time_method('float', 'rank')

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

jreback

this might also close #11759 (as you added the full signature). if so can you add a test?

jreback · 2018-02-09T12:18:34Z

pandas/_libs/groupby.pyx


 cdef int64_t iNaT = get_nat()

 cdef double NaN = <double> np.NaN
 cdef double nan = NaN

+cdef extern from "numpy/npy_math.h" nogil:


why do you need these? you can just use np.isnan no?

I was trying to do the nan check inside the nogil block hence the need for it, but your comment makes me realize that I already am creating a mask array that I can reference instead of doing that check again within the nogil.

jreback · 2018-02-09T12:19:20Z

pandas/_libs/groupby_helper.pxi.in

+                    for j in range(i - dups + 1, i + 1):
+                        if ascending:
+                            out[_as[j], 0] = j + 1 - grp_start
+                        else:


nice comments!

jreback · 2018-02-09T12:20:45Z

pandas/_libs/groupby_helper.pxi.in

+            # also be sure to reset any of the items helping to calculate dups
+            if i == N - 1 or labels[_as[i]] != labels[_as[i+1]]:
+                if pct:
+                    for j in range(grp_start, i + 1):


I think the pct here is confusing. just fill another variable with the group sizes as you go then do the division at the very end.

WillAyd · 2018-02-09T19:18:51Z

Missed your note on #11759 before latest commit. With the example provided there, this fix would now be raising TypeError: 'NoneType' object is not callable because there is no group_rank_object implementation - is that what you are looking for and if so do you not feel that is covered already by test_rank_object_raises method that was added?

jreback · 2018-02-10T16:06:13Z

thanks! ok ideally have a better message for unsupported groupbys, but can address then with a followup.

jreback · 2018-02-13T11:29:44Z

https://travis-ci.org/MacPython/pandas-wheels/jobs/340767780 is a 32-bit build which is failing

sample error

/venv/lib/python3.6/site-packages/pandas/tests/groupby/test_groupby.py:1932: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
/venv/lib/python3.6/site-packages/pandas/core/groupby.py:1804: in rank
    na_option=na_option, pct=pct, axis=axis)
/venv/lib/python3.6/site-packages/pandas/core/groupby.py:1006: in _cython_transform
    **kwargs)
/venv/lib/python3.6/site-packages/pandas/core/groupby.py:2427: in transform
    return self._cython_operation('transform', values, how, axis, **kwargs)
/venv/lib/python3.6/site-packages/pandas/core/groupby.py:2387: in _cython_operation
    **kwargs)
/venv/lib/python3.6/site-packages/pandas/core/groupby.py:2461: in _transform
    transform_func(result, values, comp_ids, is_datetimelike, **kwargs)
/venv/lib/python3.6/site-packages/pandas/core/groupby.py:2276: in wrapper
    return f(afunc, *args, **kwargs)
/venv/lib/python3.6/site-packages/pandas/core/groupby.py:2228: in <lambda>
    kwargs.get('na_option', 'keep')
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
>   ???
E   ValueError: Buffer dtype mismatch, expected 'int64_t' but got 'int'
pandas/_libs/groupby_helper.pxi:1250: ValueError

not sure exactly what is happening, if you can have a look. note that it is almost impossible to actually test this w/o merging. these are built once a day. once we have a fix which passes regular travis then can go from there. thanks.

WillAyd · 2018-02-13T17:04:36Z

EDIT: Ignore below comment - didn't look closely enough at traceback you provided where rank is the offending function

Judging by the line number I think this may be a side effect of #19635. Previously there was the following condition preceding that:

  {{if name == 'int64'}}		
  if val != {{nan_val}}:		
  {{else}}		
  if val == val and val != {{nan_val}}:
  {{endif}}

Which I simplified to

  if val == val and val != {{nan_val}}:

I can't imagine why that would make a difference but it is suspicious given location of the error. Want me to add that condition back in and submit it referencing this PR?

jreback · 2018-02-13T17:52:15Z

yep thanks

jorisvandenbossche · 2018-04-19T11:57:47Z

@WillAyd not sure if it is related to this PR, but on AppVeyor, there are a huge amount of numpy warnings ("DeprecationWarning: numpy boolean subtract, the - operator, is deprecated, use the bitwise_xor, the ^ operator, or the logical_xor function instead.") related to rank tests (so many warnings you even don't see the actual errors anymore).

See https://ci.appveyor.com/project/pandas-dev/pandas/build/1.0.13483/job/vt1rpmyub41gf6aa

jorisvandenbossche · 2018-04-19T12:02:53Z

From a quick manual appveyor bisect: introduced by #20091

WillAyd · 2018-04-19T16:36:45Z

cc @peterpanmj

WillAyd · 2018-04-19T16:46:35Z

I think this warning is coming from np.diff specifically at the below line (link goes to master):

https://github.com/numpy/numpy/blob/a9e9343ecf5a6564eb09f1d7078c407db2ff7d9e/numpy/lib/function_base.py#L1172

It looks like in 1.13 this was actually performing subtraction regardless of type, so perhaps something that just slipped through on that release?

peterpanmj · 2018-04-20T01:34:49Z

Yes, this is a result of np.diff in 1.13. I use np.dfiff in #20091. Actually something like ^ operator can be used instead.

jreback reviewed Feb 1, 2018

View reviewed changes

jreback added Groupby Performance Memory or execution speed performance labels Feb 1, 2018

WillAyd force-pushed the rank-perf branch from 01468d1 to 27cb6b5 Compare February 5, 2018 21:29

WillAyd commented Feb 5, 2018

View reviewed changes

WillAyd force-pushed the rank-perf branch from 27cb6b5 to 4caa1d0 Compare February 6, 2018 02:27

jreback requested changes Feb 6, 2018

View reviewed changes

WillAyd force-pushed the rank-perf branch from 4caa1d0 to d09268b Compare February 6, 2018 17:25

WillAyd mentioned this pull request Feb 7, 2018

Raise ValueError When Attempting to Rank Object Dtypes #19560

Closed

jreback requested changes Feb 7, 2018

View reviewed changes

WillAyd force-pushed the rank-perf branch from a508463 to cd15bbf Compare February 7, 2018 21:56

WillAyd mentioned this pull request Feb 7, 2018

CLN: consolidate last / nth object dtype groupby function implementation #19569

Closed

jreback requested changes Feb 8, 2018

View reviewed changes

WillAyd force-pushed the rank-perf branch from cd15bbf to 0141747 Compare February 8, 2018 19:49

This was referenced Feb 9, 2018

Consolidate nth / last object Groupby Implementations #19610

Merged

Rank Mixes np.nan with np.inf values #19538

Closed

jreback added this to the 0.23.0 milestone Feb 9, 2018

jreback requested changes Feb 9, 2018

View reviewed changes

WillAyd added 2 commits February 9, 2018 10:33

Initial working rank with no tiebreaker

396f1b6

Allowed kwargs to pass through to Cython func

c2c2177

WillAyd added 8 commits February 9, 2018 10:33

Added comments to groupby_helper

a6ca485

Added tests for rank bugs

fd29d70

Fixed issue with ranks not resetting across groups

b9e4719

Changed types; fixed tiebreaker float casting issue

613384c

Documentation cleanup

94a2749

Removed unused import from groupby.pyx

3ee99c0

Removed npy_isnan import

b430635

Added grp_sizes array, broke out pct calc

aa4578d

WillAyd force-pushed the rank-perf branch from a24f321 to aa4578d Compare February 9, 2018 18:37

jreback approved these changes Feb 10, 2018

View reviewed changes

jreback merged commit c1068d9 into pandas-dev:master Feb 10, 2018

WillAyd mentioned this pull request Feb 10, 2018

Consolidated Groupby nth / last object Templates #19635

Merged

4 tasks

WillAyd deleted the rank-perf branch February 10, 2018 18:52

This was referenced Feb 13, 2018

Explicitly set dtype of np.lexsort in group_rank #19679

Merged

Cythonized GroupBy Fill #19673

Merged

GroupBy Rank SegFault Fix - astype instead of view #19701

Merged

Rank With 'method=first' Broken for Objects #19482

Closed

harisbal pushed a commit to harisbal/pandas that referenced this pull request Feb 28, 2018

PERF: Cythonize Groupby Rank (pandas-dev#19481)

5ea49ef

mroeschke mentioned this pull request Feb 28, 2018

PERF: Discrepancy in groupby methods #19165

Closed

7 tasks

peterpanmj mentioned this pull request May 22, 2018

BUG: groupby.rank results do not scale to 100% for dense #20731

Closed

PERF: Cythonize Groupby Rank #19481

PERF: Cythonize Groupby Rank #19481

Conversation

WillAyd commented Jan 31, 2018 • edited Loading

pep8speaks commented Jan 31, 2018 • edited Loading

Comment last updated on February 09, 2018 at 18:37 Hours UTC

Choose a reason for hiding this comment

WillAyd commented Feb 5, 2018

Choose a reason for hiding this comment

WillAyd Feb 5, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WillAyd commented Feb 5, 2018

WillAyd Feb 5, 2018 • edited Loading

Choose a reason for hiding this comment

codecov bot commented Feb 6, 2018 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WillAyd Feb 6, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WillAyd commented Feb 7, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WillAyd commented Feb 8, 2018

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WillAyd commented Feb 9, 2018

jreback commented Feb 10, 2018

jreback commented Feb 13, 2018

WillAyd commented Feb 13, 2018 • edited Loading

jreback commented Feb 13, 2018

jorisvandenbossche commented Apr 19, 2018

jorisvandenbossche commented Apr 19, 2018

WillAyd commented Apr 19, 2018

WillAyd commented Apr 19, 2018

peterpanmj commented Apr 20, 2018

WillAyd commented Jan 31, 2018 •

edited

Loading

pep8speaks commented Jan 31, 2018 •

edited

Loading

WillAyd Feb 5, 2018 •

edited

Loading

WillAyd Feb 5, 2018 •

edited

Loading

codecov bot commented Feb 6, 2018 •

edited

Loading

WillAyd Feb 6, 2018 •

edited

Loading

WillAyd commented Feb 7, 2018 •

edited

Loading

WillAyd commented Feb 13, 2018 •

edited

Loading