ENH: Allow the groupby by param to handle columns and index levels #2636

jonmmease · 2017-08-29T14:49:13Z

Implements the changes proposed in ENH: Allow the groupby by param to handle columns and index levels #2635.
Tests added and passed
Documentation updated
Release notes added

mrocklin

A couple of small comments. I suspect that @TomAugspurger might want to take a look here

mrocklin · 2017-08-29T15:18:39Z

dask/dataframe/tests/test_groupby.py

+
+    # Compute on dask DataFrame with divisions (no shuffling)
+    result = ddf.groupby(['idx', 'a']).apply(func)
+    assert_eq(expected, result)


We might be able to test the no-shuffling assertion by checking the size of the task graph, len(result.dask). This should be much smaller in the efficient case, particularly if you shuffle under a with dask.set_options(shuffle='tasks') block

This can probably apply above as well

mrocklin · 2017-08-29T15:21:03Z

dask/dataframe/core.py

+        and/or the name of the DataFrame's index
+        :return: Dask DataFrame with columns corresponding to each column or
+        index level in columns_or_index.  If included, the column corresponding 
+        to the index level is named _index


Dask tends to use numpydoc style docstrings. http://dask.pydata.org/en/latest/develop.html#docstrings

jonmmease · 2017-08-29T15:37:13Z

Well, looks like I managed to break everything :-) Looking into it. Some (but not all) of these failures are because these changes won't work with versions of pandas before 0.20. What's the best way to inform the test suite (and users) of this dependency?

mrocklin · 2017-08-29T15:40:18Z

I guess we first have to ask ourselves "do we want to continue supporting Pandas 0.19?"

If the answer is yes then presumably things would still work in either case, but the checks about efficiency would only be run if the pandas version met some criterion. I think that there are dask.dataframe tests that use LooseVersion for checks, although given the release cycle I suspect that a straight check on the value of pd.__version__ would also work fine.

jcrist · 2017-08-29T15:55:34Z

I guess we first have to ask ourselves "do we want to continue supporting Pandas 0.19?"

0.19.0 was released in October of 2016. I'm not sure if we can drop it yet, would probably need to poll users. No strong opinions here, except that if we do drop 0.19.* it should be a major point release.

although given the release cycle I suspect that a straight check on the value of pd.version would also work fine.

You can import PANDAS_VERSION from dask.dataframe.utils. For consistency, would prefer you use that. There are several places where we check for PANDAS_VERSION >= 0.20.0

TomAugspurger

Only partway through, I'll take another look later, but overall this is looking good. Thanks!

TomAugspurger · 2017-08-29T16:36:41Z

dask/dataframe/core.py

+        """
+        # Ensure columns_or_index is a list
+        columns_or_index = (columns_or_index
+                            if isinstance(columns_or_index, list)


Maybe rewrite this to check if columns_or_index is a sequence, or maybe more easily reverse the logic and check for isinstance(columns_or_index, pd.compat.string_types)

Working on some updates now. I did the check this way because strings, integers, and tuples can all be valid column keys for pandas and I'm aiming to end up with a List of column keys. Does that make sense?

TomAugspurger · 2017-08-29T16:37:36Z

dask/dataframe/core.py

+                            if isinstance(columns_or_index, list)
+                            else [columns_or_index])
+
+        column_names = [n for n in columns_or_index


Does self.columns & columns_or_index achieve the desired result? I'm wondering about cases where n is not a scalar, and why those should be excluded from column_names.

Yeah, good catch. The scalar check was there to filter out cases where you're grouping on something like a series. I'll extract the scalar check into a method that handles tuples and is a bit clearer.

TomAugspurger · 2017-08-29T16:41:26Z

dask/dataframe/core.py

+
+    def _contains_index_name(self, columns_or_index):
+        if isinstance(columns_or_index, list):
+            return (self.index.name


Can you check self.index.name is not None, otherwise we exclude falsey index names False or 0.

TomAugspurger · 2017-08-29T16:42:32Z

dask/dataframe/core.py

+                            for n in columns_or_index
+                            if np.isscalar(n)))
+        else:
+            return (columns_or_index


Same thing about is not None

TomAugspurger · 2017-08-29T21:52:19Z

dask/dataframe/tests/test_groupby.py

+    # Test aggregate strings
+    if agg_func in {'sum', 'mean', 'var', 'size', 'std', 'count'}:
+        result = ddf_no_divs.groupby(['a', 'idx']).agg(agg_func)
+        assert_eq(expected, result)


Could you add a case that just groups by the index, and no columns?

jonmmease · 2017-08-30T14:56:18Z

Ok, just pushed changes that I believe address the review comments thus far. The new tests are all working with pandas' master (I forgot that's what I was using when I was testing the changes yesterday).

Unfortunately, I'm hitting up against a bug in pandas 0.20.3 (pandas-dev/pandas#16843) that breaks metadata calculations in a lot of cases when grouping on combinations of columns and the index. This bug was fixed last week and will be in 0.21 though I'm not familiar with the pandas release schedule timing.

To be clear, these changes don't (shouldn't!) break any existing tests with pandas 0.19 or 0.20. They just aren't going to be fully useful before pandas 0.21 lands.

Let me know how you'd like to proceed. Thanks!

TomAugspurger · 2017-08-30T16:10:17Z

This bug was fixed last week and will be in 0.21 though I'm not familiar with the pandas release schedule timing

0.21 will be released at the end of September.

It looks like the last commit had some formatting issues: https://travis-ci.org/dask/dask/jobs/270025325#L4608

jonmmease · 2017-08-30T17:37:41Z

@mrocklin @TomAugspurger Looks like tests passed with the exception of the "PYTHON=3.5 NUMPY=1.12.1 PANDAS=0.19.2" configuration.

Wading through the log, the failures seem to be caused by exceptions stating "Exception: Tried sending message after closing. Status: closed". Next to the first failure in the summary I see:

[gw1] node down: Not properly terminated
[gw1] FAILED dask/dataframe/tests/test_multi.py::test_merge_by_index_patterns[left-disk] 
Replacing crashed slave gw1

Do you think this is something related to my changes or did something flakey happen on the CI servers?

TomAugspurger · 2017-08-30T18:59:39Z

Hmm, one thing I just thought of. The new np.isscalar calls may trigger some unnescessary computations when the grouper is a dask.Series.

In [21]: df = pd.DataFrame({"A": [1] * 5 + [2] * 5, "B": ['a', 'b'] * 5, 'C': range(10)}, index=pd.Index(range(10), name='A'))

In [22]: a = dd.from_pandas(df, 2).set_index("A")

In [23]: a.groupby(a.B).C.apply(np.mean)  # will trigger an `np.isscalar(a.B)`, computing a.B

I don't know how big of a problem this is, but I think it could be avoided. I'll take another look later.

jonmmease · 2017-08-30T20:04:38Z

Oh, good point. I see what you mean. I could always check for dask collections before calling np.isscalar(). Let me know if you think of anything more elegant.

TomAugspurger · 2017-08-30T20:24:13Z

I think that sounds reasonable.

…

On Wed, Aug 30, 2017 at 3:04 PM, Jon Mease ***@***.***> wrote: Oh, good point. I see what you mean. I could always check for dask collections before calling np.isscalar(). Let me know if you think of anything more elegant. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2636 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHIszCS_nqnrHKkWaRwzVqZoRt9mjNks5sdcBYgaJpZM4PGERc> .

TomAugspurger · 2017-08-31T14:11:13Z

Thanks @jmmease

jonmmease · 2017-08-31T15:16:30Z

Sure thing. Thanks for the quick feedback @TomAugspurger and @mrocklin

mrocklin · 2017-08-31T16:28:37Z

It's great to see more hands active in dask.dataframe :)

…

On Thu, Aug 31, 2017 at 11:16 AM, Jon Mease ***@***.***> wrote: Sure thing. Thanks for the quick feedback @TomAugspurger <https://github.com/tomaugspurger> and @mrocklin <https://github.com/mrocklin> — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2636 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AASszMuZz2iezdy6S2KYBZldEeOOoZkRks5sds5PgaJpZM4PGERc> .

Allow df.groupby to accept the name of the index along with columns

5fc7d9d

mrocklin reviewed Aug 29, 2017

View reviewed changes

TomAugspurger reviewed Aug 29, 2017

View reviewed changes

TomAugspurger approved these changes Aug 29, 2017

View reviewed changes

General review cleanup

518253f

PEP8 Updates

0f6edbb

Avoid evaluating dask collections with np.isscalar when checking labels

16f910f

TomAugspurger merged commit d32e8b7 into dask:master Aug 31, 2017

TomAugspurger mentioned this pull request Aug 31, 2017

ENH: Allow the groupby by param to handle columns and index levels #2635

Closed

pp-mo mentioned this pull request Sep 4, 2017

Bump version to 0.15.3 #2654

Closed

TomAugspurger mentioned this pull request Nov 14, 2017

DataFrame.groupby level param #2887

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Allow the groupby by param to handle columns and index levels #2636

ENH: Allow the groupby by param to handle columns and index levels #2636

jonmmease commented Aug 29, 2017

mrocklin left a comment

mrocklin Aug 29, 2017

mrocklin Aug 29, 2017

mrocklin Aug 29, 2017

jonmmease commented Aug 29, 2017

mrocklin commented Aug 29, 2017

jcrist commented Aug 29, 2017

TomAugspurger left a comment

TomAugspurger Aug 29, 2017

jonmmease Aug 30, 2017

TomAugspurger Aug 29, 2017

jonmmease Aug 30, 2017

TomAugspurger Aug 29, 2017

TomAugspurger Aug 29, 2017

TomAugspurger Aug 29, 2017

jonmmease commented Aug 30, 2017

TomAugspurger commented Aug 30, 2017

jonmmease commented Aug 30, 2017

TomAugspurger commented Aug 30, 2017 •

edited

Loading

jonmmease commented Aug 30, 2017

TomAugspurger commented Aug 30, 2017 via email

TomAugspurger commented Aug 31, 2017

jonmmease commented Aug 31, 2017

mrocklin commented Aug 31, 2017 via email

ENH: Allow the groupby by param to handle columns and index levels #2636

ENH: Allow the groupby by param to handle columns and index levels #2636

Conversation

jonmmease commented Aug 29, 2017

mrocklin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jonmmease commented Aug 29, 2017

mrocklin commented Aug 29, 2017

jcrist commented Aug 29, 2017

TomAugspurger left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jonmmease commented Aug 30, 2017

TomAugspurger commented Aug 30, 2017

jonmmease commented Aug 30, 2017

TomAugspurger commented Aug 30, 2017 • edited Loading

jonmmease commented Aug 30, 2017

TomAugspurger commented Aug 30, 2017 via email

TomAugspurger commented Aug 31, 2017

jonmmease commented Aug 31, 2017

mrocklin commented Aug 31, 2017 via email

TomAugspurger commented Aug 30, 2017 •

edited

Loading