Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Add groupby().ngroup() method to count groups (#11642) #14026

Merged
merged 1 commit into from
Jun 1, 2017

Conversation

dsm054
Copy link
Contributor

@dsm054 dsm054 commented Aug 18, 2016

This basically adds a method to give access to the coding returned by grouper.group_info[0], i.e. the number of the group that each row is in. This is a natural parallel to cumcount(), and while it's not the world's most important feature it's come in handy for me from time to time and deserves a public method, IMHO.

Number each group from 0 to the number of groups - 1.

This is the enumerative complement of cumcount. Note that the
numbers given to the groups match the order in which the groups
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

versionadded tag

@chris-b1
Copy link
Contributor

chris-b1 commented Aug 18, 2016

I think you can add this function here, so that df.groupby(...).transform('enumerate') also works. Probably the same with cumcount.

@@ -969,7 +969,7 @@ Enumerate group items
.. versionadded:: 0.13.0

To see the order in which each row appears within its group, use the
``cumcount`` method:
``cumcount`` method (compare with ``enumerate``):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure what this means

@sinhrks
Copy link
Member

sinhrks commented Aug 25, 2016

Isn't the name enumerate confusing? I expected something related to iteration.

@jreback
Copy link
Contributor

jreback commented Sep 9, 2016

can you rebase / update?

@jreback
Copy link
Contributor

jreback commented Nov 16, 2016

looking back on this, shouldn't we call this g.factorize()?

@dsm054
Copy link
Contributor Author

dsm054 commented Nov 16, 2016

@jreback: I'll buy that. The collision with enumerate was always a bit unfortunate.

@jorisvandenbossche
Copy link
Member

I agree enumerate can be confusing given that it is not in an iteration context here, but I find factorize also not ideal (give its notion of categoricals).

The verb is also 'number'. But that is maybe also a bit strange to use given the use of number as a noun? According to google translate/oxford dictionary:

number: Mark with a number or assign a number to, typically to indicate position in a series:

Synonyms: assign a number to, categorize by number, specify by number, mark with a number, itemize, enumerate

So this points back to enumerate ..

@jreback
Copy link
Contributor

jreback commented Nov 16, 2016

how about .categorize()? or .group_number()?

4 0
5 1
dtype: int64
>>> df = pd.DataFrame([['b'], ['a'], ['a'], ['b']], columns=['A'])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a canonical way to create the df? I'd generally have used pd.DataFrame({'A': list('baab')})

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or without the list expansion is fine too

@jreback
Copy link
Contributor

jreback commented Dec 21, 2016

any more thoughts on the name? does SQL call this anything?

summary (and some new)

  • .factorize()
  • .enumerate()
  • .categorize()
  • .group_number()
  • .ngroup()
  • .number()

@jreback
Copy link
Contributor

jreback commented Jan 21, 2017

any further thoughts on this?

@jreback
Copy link
Contributor

jreback commented Mar 20, 2017

@dsm054 thoughts on this?

@dsm054
Copy link
Contributor Author

dsm054 commented Mar 21, 2017

Looking back I still think that enumerate was a poor choice of name because we're not returning an iterable of tuples, and factorize and categorize suggest contracts that I don't think we necessarily want to commit to (namely, that the numbers and/or types will match those that pandas would give otherwise).

group_number is explicit, at least, and ngroup has a certain parallel with ngroups which is kind of appealing. number feels too open-ended to me.

In SQL I'd use something like DENSE_RANK() to get this.

@jreback
Copy link
Contributor

jreback commented Mar 21, 2017

I like .ngroup()

dsm054 added a commit to dsm054/pandas that referenced this pull request Mar 22, 2017
dsm054 added a commit to dsm054/pandas that referenced this pull request Mar 22, 2017
@codecov
Copy link

codecov bot commented Mar 22, 2017

Codecov Report

❗ No coverage uploaded for pull request base (master@fb47ee5). Click here to learn what that means.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master   #14026   +/-   ##
=========================================
  Coverage          ?   90.76%           
=========================================
  Files             ?      161           
  Lines             ?    51098           
  Branches          ?        0           
=========================================
  Hits              ?    46377           
  Misses            ?     4721           
  Partials          ?        0
Flag Coverage Δ
#multiple 88.6% <100%> (?)
#single 40.16% <33.33%> (?)
Impacted Files Coverage Δ
pandas/core/groupby.py 92.07% <100%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fb47ee5...14d941b. Read the comment docs.

@jorisvandenbossche
Copy link
Member

I agree that factorize and categorize are not ideal (for me due to the link to categoricals, and the result is here is not a categorical).
I first didn't like ngroup, as it looked too much as the existing ngroups, and it's not the same. But maybe that is not bad after all. group_number would be the most explicit. But OK with ngroup I think.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! (nice suite of tests!)


df.groupby('A').ngroup()

df.groupby('A').ngroup(ascending=False) # kwarg only
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can remove the "kwarg only" I think. It's not really clear to what it points, and it is also not correct I think

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, this was a copy/paste from the cumcount doc which I didn't remove. I don't think it holds for that either, so I'll remove them both.

way similar to ``pd.factorize()``, but which applies naturally to multiple
columns of mixed type and different sources:

.. ipython::python
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

space after the :: needed

assert_series_equal(g_ngroup, expected_ngroup)
assert_series_equal(g_cumcount, expected_cumcount)

def test_ngroup_cumcount_pair(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you testing here something specific? Or just to have more test cases?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you mean test_ngroup_matches_cumcount, yeah, it's just a specific two-column case to make sure they align. If you mean test_ngroup_cumcount_pair, that's to make sure that the (ngroup, cumcount) pair you can assign to each row matches expectation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant the test_ngroup_cumcount_pair. It is just not really clear to me what the test specifically tries to test in addition to the other tests

@jorisvandenbossche
Copy link
Member

Just to be sure we have considered this: we have the question whether this should behave similar to cumcount (transformer) or rather similar to size (reducer)?
But I suppose it is mainly the transforming case that is useful?

In the second case we could always have both behaviours as .ngroup() vs .transform('ngroup')

@jreback jreback changed the title ENH: Add groupby().enumerate method to count groups (#11642) ENH: Add groupby().ngroup() method to count groups (#11642) Mar 22, 2017
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. some doc comments. ping when ready.


.. ipython:: python

df = pd.DataFrame(list('aaabba'), columns=['A'])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

name this df something else as it might be clobbering the other parts of the docs (.e.g dfg or something)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alternatively, maybe better is to use the same example as cumcount (change if needed to make it easier to talk about), and then you can contrast it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's already the same as the cumcount example, I think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh ok, just reading as a user want to know: what is the difference between these and when should I use one over other (which you explain in the doc-string, just add somewhere here)

.. versionadded:: 0.20.0

To see the ordering of the groups themselves, you can use the ``ngroup``
method:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add a comment differentiating this from .cumcount (when to use which)

.. ipython::python

df = pd.DataFrame({"A": [1, 1, 2, 3, 2], "B": list("aaaba")})

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you elaborate on why this is useful

Groupby Group Numbers
^^^^^^^^^^^^^^^^^^^^^

A new groupby method ``ngroup``, parallel to the existing ``cumcount``, has been added to return the group order (:issue:`11642`).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.ngroup() and .cumcount()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or even better: link to their docstring page, using :meth:`~pandas.core.groupby.GroupBy.ngroup` (will only display the ngroup)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the grammar actually works better without the parentheses -- type(df.groupby("a").ngroup) is what gives "method", after all. Should I do this to every mention of ngroup in this doc?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dsm054 usually do this in the first / most prominent mention (so here is obviously good). you certainly can do it more. I would also add the ref for cumcount here.

^^^^^^^^^^^^^^^^^^^^^

A new groupby method ``ngroup``, parallel to the existing ``cumcount``, has been added to return the group order (:issue:`11642`).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a ref back to the docs

This is the enumerative complement of cumcount. Note that the
numbers given to the groups match the order in which the groups
would be seen when iterating over the groupby object, not the
order they are first observed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this comment is good, add something like this to the docs (in groupy.rst) where you show an example

4 0
5 1
dtype: int64
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add an example with multple groupers

@@ -4304,6 +4251,192 @@ def test_cummin_cummax(self):
tm.assert_series_equal(expected, result)


class TestCounting(tm.TestCase):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perfect you moved them! even better, can you move to a separate file

pandas/tests/groupby/test_counting.py (cumcount & ngroup). If you find other methods that are very relevant ok (note I wouldn't move .count()).

dsm054 added a commit to dsm054/pandas that referenced this pull request May 29, 2017
dsm054 added a commit to dsm054/pandas that referenced this pull request May 29, 2017
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor doc comments. lgtm otherwise. ping on green.

~~~~~~~~~~~~~~~~~~~~~~~~~~

By using ``.ngroup()``, we can extract information about the groups in a
way similar to ``pd.factorize()``, but which applies naturally to multiple
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you make this :func:`factorize` and point a ref link to the factorize section of docs (in reshaping IIRC0

@@ -21,6 +21,7 @@ Enhancements

- Unblocked access to additional compression types supported in pytables: 'blosc:blosclz, 'blosc:lz4', 'blosc:lz4hc', 'blosc:snappy', 'blosc:zlib', 'blosc:zstd' (:issue:`14478`)
- ``Series`` provides a ``to_latex`` method (:issue:`16180`)
- A new groupby method :meth:`~pandas.core.groupby.GroupBy.ngroup`, parallel to the existing :meth:`~pandas.core.groupby.GroupBy.cumcount`, has been added to return the group order (:issue:`11642`).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a ref to the docs you just created here

@jreback jreback added this to the 0.20.2 milestone May 31, 2017
dsm054 added a commit to dsm054/pandas that referenced this pull request Jun 1, 2017
dsm054 added a commit to dsm054/pandas that referenced this pull request Jun 1, 2017
dsm054 added a commit to dsm054/pandas that referenced this pull request Jun 1, 2017
@jreback
Copy link
Contributor

jreback commented Jun 1, 2017

@dsm054 if you can update docs as indicated and rebase today would be great.

dsm054 added a commit to dsm054/pandas that referenced this pull request Jun 1, 2017
@dsm054
Copy link
Contributor Author

dsm054 commented Jun 1, 2017

@jreback: had to rebase this morning 'cause of that linting error you fixed. I'll keep doing so until we're green.

@dsm054
Copy link
Contributor Author

dsm054 commented Jun 1, 2017

test_write_fspath_all[to_hdf-writer_kwargs3-tables] failed, which doesn't seem related.

@TomAugspurger
Copy link
Contributor

test_write_fspath_all[to_hdf-writer_kwargs3-tables] failed, which doesn't seem related.

yeah, I didn't realize that test was flaky, but I am able to reproduce it occasionally.

@jreback jreback merged commit 72e0d1f into pandas-dev:master Jun 1, 2017
@jreback
Copy link
Contributor

jreback commented Jun 1, 2017

thanks @dsm054 nice PR!

TomAugspurger pushed a commit to TomAugspurger/pandas that referenced this pull request Jun 2, 2017
TomAugspurger pushed a commit that referenced this pull request Jun 4, 2017
yarikoptic added a commit to neurodebian/pandas that referenced this pull request Jul 12, 2017
Version 0.20.2

* tag 'v0.20.2': (68 commits)
  RLS: v0.20.2
  DOC: Update release.rst
  DOC: Whatsnew fixups (pandas-dev#16596)
  ERRR: Raise error in usecols when column doesn't exist but length matches (pandas-dev#16460)
  BUG: convert numpy strings in index names in HDF pandas-dev#13492 (pandas-dev#16444)
  PERF: vectorize _interp_limit (pandas-dev#16592)
  DOC: whatsnew 0.20.2 edits (pandas-dev#16587)
  API: Make is_strictly_monotonic_* private (pandas-dev#16576)
  BUG: reimplement MultiIndex.remove_unused_levels (pandas-dev#16565)
  Strictly monotonic (pandas-dev#16555)
  ENH: add .ngroup() method to groupby objects (pandas-dev#14026) (pandas-dev#14026)
  fix linting
  BUG: Incorrect handling of rolling.cov with offset window (pandas-dev#16244)
  BUG: select_as_multiple doesn't respect start/stop kwargs GH16209 (pandas-dev#16317)
  return empty MultiIndex for symmetrical difference on equal MultiIndexes (pandas-dev#16486)
  BUG: Bug in .resample() and .groupby() when aggregating on integers (pandas-dev#16549)
  BUG: Fixed tput output on windows (pandas-dev#16496)
  Strictly monotonic (pandas-dev#16555)
  BUG: fixed wrong order of ordered labels in pd.cut()
  BUG: Fixed to_html ignoring index_names parameter
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: enumerate groups
7 participants