ENH: Add groupby().ngroup() method to count groups (#11642) #14026

dsm054 · 2016-08-18T01:44:22Z

This basically adds a method to give access to the coding returned by grouper.group_info[0], i.e. the number of the group that each row is in. This is a natural parallel to cumcount(), and while it's not the world's most important feature it's come in handy for me from time to time and deserves a public method, IMHO.

closes ENH: enumerate groups #11642
9 tests added / passed
passes git diff upstream/master | flake8 --diff
whatsnew entry

jreback · 2016-08-18T01:47:56Z

pandas/core/groupby.py

+        Number each group from 0 to the number of groups - 1.
+
+        This is the enumerative complement of cumcount.  Note that the
+        numbers given to the groups match the order in which the groups


versionadded tag

chris-b1 · 2016-08-18T02:28:25Z

I think you can add this function here, so that df.groupby(...).transform('enumerate') also works. Probably the same with cumcount.

jreback · 2016-08-18T10:52:19Z

doc/source/groupby.rst

@@ -969,7 +969,7 @@ Enumerate group items
 .. versionadded:: 0.13.0

 To see the order in which each row appears within its group, use the
-``cumcount`` method:
+``cumcount`` method (compare with ``enumerate``):


not sure what this means

sinhrks · 2016-08-25T21:45:02Z

Isn't the name enumerate confusing? I expected something related to iteration.

jreback · 2016-09-09T22:32:31Z

can you rebase / update?

jreback · 2016-11-16T00:54:41Z

looking back on this, shouldn't we call this g.factorize()?

dsm054 · 2016-11-16T16:20:22Z

@jreback: I'll buy that. The collision with enumerate was always a bit unfortunate.

jorisvandenbossche · 2016-11-16T20:15:09Z

I agree enumerate can be confusing given that it is not in an iteration context here, but I find factorize also not ideal (give its notion of categoricals).

The verb is also 'number'. But that is maybe also a bit strange to use given the use of number as a noun? According to google translate/oxford dictionary:

number: Mark with a number or assign a number to, typically to indicate position in a series:

Synonyms: assign a number to, categorize by number, specify by number, mark with a number, itemize, enumerate

So this points back to enumerate ..

jreback · 2016-11-16T20:56:43Z

how about .categorize()? or .group_number()?

max-sixty · 2016-11-16T22:24:34Z

pandas/core/groupby.py

+        4    0
+        5    1
+        dtype: int64
+        >>> df = pd.DataFrame([['b'], ['a'], ['a'], ['b']], columns=['A'])


Is this a canonical way to create the df? I'd generally have used pd.DataFrame({'A': list('baab')})

Or without the list expansion is fine too

jreback · 2016-12-21T23:14:28Z

any more thoughts on the name? does SQL call this anything?

summary (and some new)

.factorize()
.enumerate()
.categorize()
.group_number()
.ngroup()
.number()

jreback · 2017-01-21T23:18:00Z

any further thoughts on this?

jreback · 2017-03-20T13:55:28Z

@dsm054 thoughts on this?

dsm054 · 2017-03-21T16:22:09Z

Looking back I still think that enumerate was a poor choice of name because we're not returning an iterable of tuples, and factorize and categorize suggest contracts that I don't think we necessarily want to commit to (namely, that the numbers and/or types will match those that pandas would give otherwise).

group_number is explicit, at least, and ngroup has a certain parallel with ngroups which is kind of appealing. number feels too open-ended to me.

In SQL I'd use something like DENSE_RANK() to get this.

jreback · 2017-03-21T16:32:40Z

I like .ngroup()

Closes pandas-dev#11642

codecov · 2017-03-22T04:02:05Z

Codecov Report

❗ No coverage uploaded for pull request base (master@fb47ee5). Click here to learn what that means.
The diff coverage is 100%.

@@            Coverage Diff            @@
##             master   #14026   +/-   ##
=========================================
  Coverage          ?   90.76%           
=========================================
  Files             ?      161           
  Lines             ?    51098           
  Branches          ?        0           
=========================================
  Hits              ?    46377           
  Misses            ?     4721           
  Partials          ?        0

Flag	Coverage Δ
#multiple	`88.6% <100%> (?)`
#single	`40.16% <33.33%> (?)`

Impacted Files	Coverage Δ
pandas/core/groupby.py	`92.07% <100%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fb47ee5...14d941b. Read the comment docs.

jorisvandenbossche · 2017-03-22T08:45:08Z

I agree that factorize and categorize are not ideal (for me due to the link to categoricals, and the result is here is not a categorical).
I first didn't like ngroup, as it looked too much as the existing ngroups, and it's not the same. But maybe that is not bad after all. group_number would be the most explicit. But OK with ngroup I think.

jorisvandenbossche

Looks good! (nice suite of tests!)

jorisvandenbossche · 2017-03-22T08:27:00Z

doc/source/groupby.rst

+
+   df.groupby('A').ngroup()
+
+   df.groupby('A').ngroup(ascending=False)  # kwarg only


You can remove the "kwarg only" I think. It's not really clear to what it points, and it is also not correct I think

Ah, this was a copy/paste from the cumcount doc which I didn't remove. I don't think it holds for that either, so I'll remove them both.

jorisvandenbossche · 2017-03-22T08:35:09Z

doc/source/groupby.rst

+way similar to ``pd.factorize()``, but which applies naturally to multiple
+columns of mixed type and different sources:
+
+.. ipython::python


space after the :: needed

jorisvandenbossche · 2017-03-22T08:42:22Z

pandas/tests/groupby/test_groupby.py

+        assert_series_equal(g_ngroup, expected_ngroup)
+        assert_series_equal(g_cumcount, expected_cumcount)
+
+    def test_ngroup_cumcount_pair(self):


Are you testing here something specific? Or just to have more test cases?

If you mean test_ngroup_matches_cumcount, yeah, it's just a specific two-column case to make sure they align. If you mean test_ngroup_cumcount_pair, that's to make sure that the (ngroup, cumcount) pair you can assign to each row matches expectation.

I meant the test_ngroup_cumcount_pair. It is just not really clear to me what the test specifically tries to test in addition to the other tests

jorisvandenbossche · 2017-03-22T08:49:40Z

Just to be sure we have considered this: we have the question whether this should behave similar to cumcount (transformer) or rather similar to size (reducer)?
But I suppose it is mainly the transforming case that is useful?

In the second case we could always have both behaviours as .ngroup() vs .transform('ngroup')

jreback

lgtm. some doc comments. ping when ready.

jreback · 2017-03-22T13:29:19Z

doc/source/groupby.rst

+
+.. ipython:: python
+
+   df = pd.DataFrame(list('aaabba'), columns=['A'])


name this df something else as it might be clobbering the other parts of the docs (.e.g dfg or something)

alternatively, maybe better is to use the same example as cumcount (change if needed to make it easier to talk about), and then you can contrast it

It's already the same as the cumcount example, I think.

oh ok, just reading as a user want to know: what is the difference between these and when should I use one over other (which you explain in the doc-string, just add somewhere here)

jreback · 2017-03-22T13:29:51Z

doc/source/groupby.rst

+.. versionadded:: 0.20.0
+
+To see the ordering of the groups themselves, you can use the ``ngroup``
+method:


I would add a comment differentiating this from .cumcount (when to use which)

jreback · 2017-03-22T13:31:11Z

doc/source/groupby.rst

+.. ipython::python
+
+    df = pd.DataFrame({"A": [1, 1, 2, 3, 2], "B": list("aaaba")})
+


can you elaborate on why this is useful

jreback · 2017-03-22T13:31:32Z

doc/source/whatsnew/v0.20.0.txt

+Groupby Group Numbers
+^^^^^^^^^^^^^^^^^^^^^
+
+A new groupby method ``ngroup``, parallel to the existing ``cumcount``, has been added to return the group order (:issue:`11642`).


.ngroup() and .cumcount()

or even better: link to their docstring page, using :meth:`~pandas.core.groupby.GroupBy.ngroup` (will only display the ngroup)

I think the grammar actually works better without the parentheses -- type(df.groupby("a").ngroup) is what gives "method", after all. Should I do this to every mention of ngroup in this doc?

@dsm054 usually do this in the first / most prominent mention (so here is obviously good). you certainly can do it more. I would also add the ref for cumcount here.

jreback · 2017-03-22T13:31:47Z

doc/source/whatsnew/v0.20.0.txt

+^^^^^^^^^^^^^^^^^^^^^
+
+A new groupby method ``ngroup``, parallel to the existing ``cumcount``, has been added to return the group order (:issue:`11642`).
+


add a ref back to the docs

jreback · 2017-03-22T13:32:15Z

pandas/core/groupby.py

+        This is the enumerative complement of cumcount.  Note that the
+        numbers given to the groups match the order in which the groups
+        would be seen when iterating over the groupby object, not the
+        order they are first observed.


this comment is good, add something like this to the docs (in groupy.rst) where you show an example

jreback · 2017-03-22T13:32:39Z

pandas/core/groupby.py

+        4    0
+        5    1
+        dtype: int64
+        """


can you add an example with multple groupers

jreback · 2017-03-22T13:33:55Z

pandas/tests/groupby/test_groupby.py

@@ -4304,6 +4251,192 @@ def test_cummin_cummax(self):
            tm.assert_series_equal(expected, result)


+class TestCounting(tm.TestCase):


perfect you moved them! even better, can you move to a separate file

pandas/tests/groupby/test_counting.py (cumcount & ngroup). If you find other methods that are very relevant ok (note I wouldn't move .count()).

jreback

minor doc comments. lgtm otherwise. ping on green.

jreback · 2017-05-31T13:59:01Z

doc/source/groupby.rst

+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+By using ``.ngroup()``, we can extract information about the groups in a
+way similar to ``pd.factorize()``, but which applies naturally to multiple


can you make this :func:`factorize` and point a ref link to the factorize section of docs (in reshaping IIRC0

jreback · 2017-05-31T14:00:16Z

doc/source/whatsnew/v0.20.2.txt

@@ -21,6 +21,7 @@ Enhancements

 - Unblocked access to additional compression types supported in pytables: 'blosc:blosclz, 'blosc:lz4', 'blosc:lz4hc', 'blosc:snappy', 'blosc:zlib', 'blosc:zstd' (:issue:`14478`)
 - ``Series`` provides a ``to_latex`` method (:issue:`16180`)
+- A new groupby method :meth:`~pandas.core.groupby.GroupBy.ngroup`, parallel to the existing :meth:`~pandas.core.groupby.GroupBy.cumcount`, has been added to return the group order (:issue:`11642`).


add a ref to the docs you just created here

jreback · 2017-06-01T10:42:21Z

@dsm054 if you can update docs as indicated and rebase today would be great.

dsm054 · 2017-06-01T14:27:40Z

@jreback: had to rebase this morning 'cause of that linting error you fixed. I'll keep doing so until we're green.

dsm054 · 2017-06-01T16:55:21Z

test_write_fspath_all[to_hdf-writer_kwargs3-tables] failed, which doesn't seem related.

TomAugspurger · 2017-06-01T18:05:23Z

test_write_fspath_all[to_hdf-writer_kwargs3-tables] failed, which doesn't seem related.

yeah, I didn't realize that test was flaky, but I am able to reproduce it occasionally.

jreback · 2017-06-01T22:12:22Z

thanks @dsm054 nice PR!

…as-dev#14026) (cherry picked from commit 72e0d1f)

(cherry picked from commit 72e0d1f)

…as-dev#14026)

Version 0.20.2 * tag 'v0.20.2': (68 commits) RLS: v0.20.2 DOC: Update release.rst DOC: Whatsnew fixups (pandas-dev#16596) ERRR: Raise error in usecols when column doesn't exist but length matches (pandas-dev#16460) BUG: convert numpy strings in index names in HDF pandas-dev#13492 (pandas-dev#16444) PERF: vectorize _interp_limit (pandas-dev#16592) DOC: whatsnew 0.20.2 edits (pandas-dev#16587) API: Make is_strictly_monotonic_* private (pandas-dev#16576) BUG: reimplement MultiIndex.remove_unused_levels (pandas-dev#16565) Strictly monotonic (pandas-dev#16555) ENH: add .ngroup() method to groupby objects (pandas-dev#14026) (pandas-dev#14026) fix linting BUG: Incorrect handling of rolling.cov with offset window (pandas-dev#16244) BUG: select_as_multiple doesn't respect start/stop kwargs GH16209 (pandas-dev#16317) return empty MultiIndex for symmetrical difference on equal MultiIndexes (pandas-dev#16486) BUG: Bug in .resample() and .groupby() when aggregating on integers (pandas-dev#16549) BUG: Fixed tput output on windows (pandas-dev#16496) Strictly monotonic (pandas-dev#16555) BUG: fixed wrong order of ordered labels in pd.cut() BUG: Fixed to_html ignoring index_names parameter ...

jreback reviewed Aug 18, 2016
View reviewed changes

jreback added Enhancement Groupby labels Aug 18, 2016

jreback reviewed Aug 18, 2016
View reviewed changes

max-sixty reviewed Nov 16, 2016

View reviewed changes

dsm054 force-pushed the feature/group_enumerate branch from a6e60a7 to 7aee071 Compare March 22, 2017 02:24

dsm054 added a commit to dsm054/pandas that referenced this pull request Mar 22, 2017

ENH: add .ngroup() method to groupby objects (pandas-dev#14026)

7aee071

Closes pandas-dev#11642

dsm054 force-pushed the feature/group_enumerate branch from 7aee071 to 966f9be Compare March 22, 2017 04:01

dsm054 added a commit to dsm054/pandas that referenced this pull request Mar 22, 2017

ENH: add .ngroup() method to groupby objects (pandas-dev#14026)

966f9be

Closes pandas-dev#11642

jorisvandenbossche reviewed Mar 22, 2017

View reviewed changes

jreback changed the title ~~ENH: Add groupby().enumerate method to count groups (#11642)~~ ENH: Add groupby().ngroup() method to count groups (#11642) Mar 22, 2017

jreback approved these changes Mar 22, 2017

View reviewed changes

dsm054 added a commit to dsm054/pandas that referenced this pull request May 29, 2017

ENH: add .ngroup() method to groupby objects (pandas-dev#14026)

b9b484e

dsm054 force-pushed the feature/group_enumerate branch from b9b484e to 053935b Compare May 29, 2017 03:12

dsm054 added a commit to dsm054/pandas that referenced this pull request May 29, 2017

ENH: add .ngroup() method to groupby objects (pandas-dev#14026)

053935b

jreback approved these changes May 31, 2017

View reviewed changes

jreback added this to the 0.20.2 milestone May 31, 2017

dsm054 force-pushed the feature/group_enumerate branch from 053935b to 5bb1551 Compare June 1, 2017 02:28

dsm054 added a commit to dsm054/pandas that referenced this pull request Jun 1, 2017

ENH: add .ngroup() method to groupby objects (pandas-dev#14026)

5bb1551

dsm054 force-pushed the feature/group_enumerate branch from 5bb1551 to 7d3dd0c Compare June 1, 2017 03:59

dsm054 added a commit to dsm054/pandas that referenced this pull request Jun 1, 2017

ENH: add .ngroup() method to groupby objects (pandas-dev#14026)

7d3dd0c

dsm054 force-pushed the feature/group_enumerate branch from 7d3dd0c to 73f0d6a Compare June 1, 2017 04:07

dsm054 added a commit to dsm054/pandas that referenced this pull request Jun 1, 2017

ENH: add .ngroup() method to groupby objects (pandas-dev#14026)

73f0d6a

dsm054 force-pushed the feature/group_enumerate branch from 73f0d6a to 8383c54 Compare June 1, 2017 11:31

dsm054 added a commit to dsm054/pandas that referenced this pull request Jun 1, 2017

ENH: add .ngroup() method to groupby objects (pandas-dev#14026)

8383c54

ENH: add .ngroup() method to groupby objects (pandas-dev#14026)

14d941b

dsm054 force-pushed the feature/group_enumerate branch from 8383c54 to 14d941b Compare June 1, 2017 12:01

TomAugspurger approved these changes Jun 1, 2017

View reviewed changes

TomAugspurger mentioned this pull request Jun 1, 2017

TST: Make HDF5 fspath write test robust #16575

Merged

jreback added the Needs Backport label Jun 1, 2017

jreback merged commit 72e0d1f into pandas-dev:master Jun 1, 2017

TomAugspurger pushed a commit to TomAugspurger/pandas that referenced this pull request Jun 2, 2017

ENH: add .ngroup() method to groupby objects (pandas-dev#14026) (pand…

8284932

…as-dev#14026) (cherry picked from commit 72e0d1f)

TomAugspurger pushed a commit that referenced this pull request Jun 4, 2017

ENH: add .ngroup() method to groupby objects (#14026) (#14026)

8c7ddbe

(cherry picked from commit 72e0d1f)

TomAugspurger removed the Needs Backport label Jun 4, 2017

Kiv pushed a commit to Kiv/pandas that referenced this pull request Jun 11, 2017

ENH: add .ngroup() method to groupby objects (pandas-dev#14026) (pand…

b8ca9fc

…as-dev#14026)

stangirala pushed a commit to stangirala/pandas that referenced this pull request Jun 11, 2017

ENH: add .ngroup() method to groupby objects (pandas-dev#14026) (pand…

d551ec1

…as-dev#14026)


		df.groupby('A').ngroup()

		df.groupby('A').ngroup(ascending=False) # kwarg only


		.. ipython:: python

		df = pd.DataFrame(list('aaabba'), columns=['A'])

		.. ipython::python

		df = pd.DataFrame({"A": [1, 1, 2, 3, 2], "B": list("aaaba")})

		^^^^^^^^^^^^^^^^^^^^^

		A new groupby method ``ngroup``, parallel to the existing ``cumcount``, has been added to return the group order (:issue:`11642`).

		@@ -4304,6 +4251,192 @@ def test_cummin_cummax(self):
		tm.assert_series_equal(expected, result)


		class TestCounting(tm.TestCase):

ENH: Add groupby().ngroup() method to count groups (#11642) #14026

ENH: Add groupby().ngroup() method to count groups (#11642) #14026

Conversation

dsm054 commented Aug 18, 2016

Choose a reason for hiding this comment

chris-b1 commented Aug 18, 2016 • edited Loading

Choose a reason for hiding this comment

sinhrks commented Aug 25, 2016 • edited Loading

jreback commented Sep 9, 2016

jreback commented Nov 16, 2016

dsm054 commented Nov 16, 2016

jorisvandenbossche commented Nov 16, 2016

jreback commented Nov 16, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Dec 21, 2016 • edited Loading

jreback commented Jan 21, 2017

jreback commented Mar 20, 2017

dsm054 commented Mar 21, 2017

jreback commented Mar 21, 2017

codecov bot commented Mar 22, 2017 • edited Loading

Codecov Report

jorisvandenbossche commented Mar 22, 2017

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche commented Mar 22, 2017

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Jun 1, 2017

dsm054 commented Jun 1, 2017

dsm054 commented Jun 1, 2017

TomAugspurger commented Jun 1, 2017

jreback commented Jun 1, 2017

chris-b1 commented Aug 18, 2016 •

edited

Loading

sinhrks commented Aug 25, 2016 •

edited

Loading

jreback commented Dec 21, 2016 •

edited

Loading

codecov bot commented Mar 22, 2017 •

edited

Loading