Skip to content

Commit

Permalink
ENH: add .ngroup() method to groupby objects (pandas-dev#14026) (pand…
Browse files Browse the repository at this point in the history
  • Loading branch information
dsm054 authored and stangirala committed Jun 11, 2017
1 parent 688a329 commit d551ec1
Show file tree
Hide file tree
Showing 8 changed files with 338 additions and 63 deletions.
1 change: 1 addition & 0 deletions doc/source/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1707,6 +1707,7 @@ Computations / Descriptive Stats
GroupBy.mean
GroupBy.median
GroupBy.min
GroupBy.ngroup
GroupBy.nth
GroupBy.ohlc
GroupBy.prod
Expand Down
63 changes: 57 additions & 6 deletions doc/source/groupby.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1122,12 +1122,36 @@ To see the order in which each row appears within its group, use the

.. ipython:: python
df = pd.DataFrame(list('aaabba'), columns=['A'])
df
dfg = pd.DataFrame(list('aaabba'), columns=['A'])
dfg
dfg.groupby('A').cumcount()
dfg.groupby('A').cumcount(ascending=False)
.. _groupby.ngroup:

Enumerate groups
~~~~~~~~~~~~~~~~

.. versionadded:: 0.20.2

To see the ordering of the groups (as opposed to the order of rows
within a group given by ``cumcount``) you can use the ``ngroup``
method.

Note that the numbers given to the groups match the order in which the
groups would be seen when iterating over the groupby object, not the
order they are first observed.

.. ipython:: python
df.groupby('A').cumcount()
dfg = pd.DataFrame(list('aaabba'), columns=['A'])
dfg
df.groupby('A').cumcount(ascending=False) # kwarg only
dfg.groupby('A').ngroup()
dfg.groupby('A').ngroup(ascending=False)
Plotting
~~~~~~~~
Expand Down Expand Up @@ -1176,14 +1200,41 @@ Regroup columns of a DataFrame according to their sum, and sum the aggregated on
df
df.groupby(df.sum(), axis=1).sum()
.. _groupby.multicolumn_factorization
Multi-column factorization
~~~~~~~~~~~~~~~~~~~~~~~~~~

By using ``.ngroup()``, we can extract information about the groups in
a way similar to :func:`factorize` (as described further in the
:ref:`reshaping API <reshaping.factorization>`) but which applies
naturally to multiple columns of mixed type and different
sources. This can be useful as an intermediate categorical-like step
in processing, when the relationships between the group rows are more
important than their content, or as input to an algorithm which only
accepts the integer encoding. (For more information about support in
pandas for full categorical data, see the :ref:`Categorical
introduction <categorical>` and the
:ref:`API documentation <api.categorical>`.)

.. ipython:: python
dfg = pd.DataFrame({"A": [1, 1, 2, 3, 2], "B": list("aaaba")})
dfg
dfg.groupby(["A", "B"]).ngroup()
dfg.groupby(["A", [0, 0, 0, 1, 1]]).ngroup()
Groupby by Indexer to 'resample' data
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Resampling produces new hypothetical samples(resamples) from already existing observed data or from a model that generates data. These new samples are similar to the pre-existing samples.
Resampling produces new hypothetical samples (resamples) from already existing observed data or from a model that generates data. These new samples are similar to the pre-existing samples.

In order to resample to work on indices that are non-datetimelike , the following procedure can be utilized.

In the following examples, **df.index // 5** returns a binary array which is used to determine what get's selected for the groupby operation.
In the following examples, **df.index // 5** returns a binary array which is used to determine what gets selected for the groupby operation.

.. note:: The below example shows how we can downsample by consolidation of samples into fewer samples. Here by using **df.index // 5**, we are aggregating the samples in bins. By applying **std()** function, we aggregate the information contained in many samples into a small subset of values which is their standard deviation thereby reducing the number of samples.

Expand Down
2 changes: 1 addition & 1 deletion doc/source/reshaping.rst
Original file line number Diff line number Diff line change
Expand Up @@ -636,7 +636,7 @@ When a column contains only one level, it will be omitted in the result.
pd.get_dummies(df, drop_first=True)
.. _reshaping.factorize:

Factorizing values
------------------
Expand Down
5 changes: 5 additions & 0 deletions doc/source/whatsnew/v0.20.2.txt
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,11 @@ Enhancements
- ``Series`` provides a ``to_latex`` method (:issue:`16180`)
- Added :attr:`Index.is_strictly_monotonic_increasing` and :attr:`Index.is_strictly_monotonic_decreasing` properties (:issue:`16515`)

- A new groupby method :meth:`~pandas.core.groupby.GroupBy.ngroup`,
parallel to the existing :meth:`~pandas.core.groupby.GroupBy.cumcount`,
has been added to return the group order (:issue:`11642`); see
:ref:`here <groupby.ngroup>`.

.. _whatsnew_0202.performance:

Performance Improvements
Expand Down
75 changes: 74 additions & 1 deletion pandas/core/groupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -150,7 +150,7 @@
'last', 'first',
'head', 'tail', 'median',
'mean', 'sum', 'min', 'max',
'cumcount',
'cumcount', 'ngroup',
'resample',
'rank', 'quantile',
'fillna',
Expand Down Expand Up @@ -1437,6 +1437,75 @@ def nth(self, n, dropna=None):

return result

@Substitution(name='groupby')
@Appender(_doc_template)
def ngroup(self, ascending=True):
"""
Number each group from 0 to the number of groups - 1.
This is the enumerative complement of cumcount. Note that the
numbers given to the groups match the order in which the groups
would be seen when iterating over the groupby object, not the
order they are first observed.
.. versionadded:: 0.20.2
Parameters
----------
ascending : bool, default True
If False, number in reverse, from number of group - 1 to 0.
Examples
--------
>>> df = pd.DataFrame({"A": list("aaabba")})
>>> df
A
0 a
1 a
2 a
3 b
4 b
5 a
>>> df.groupby('A').ngroup()
0 0
1 0
2 0
3 1
4 1
5 0
dtype: int64
>>> df.groupby('A').ngroup(ascending=False)
0 1
1 1
2 1
3 0
4 0
5 1
dtype: int64
>>> df.groupby(["A", [1,1,2,3,2,1]]).ngroup()
0 0
1 0
2 1
3 3
4 2
5 0
dtype: int64
See also
--------
.cumcount : Number the rows in each group.
"""

self._set_group_selection()

index = self._selected_obj.index
result = Series(self.grouper.group_info[0], index)
if not ascending:
result = self.ngroups - 1 - result
return result

@Substitution(name='groupby')
@Appender(_doc_template)
def cumcount(self, ascending=True):
Expand Down Expand Up @@ -1481,6 +1550,10 @@ def cumcount(self, ascending=True):
4 0
5 0
dtype: int64
See also
--------
.ngroup : Number the groups themselves.
"""

self._set_group_selection()
Expand Down
Loading

0 comments on commit d551ec1

Please sign in to comment.