BUG in .groupby.apply when applying a function that has mixed data types and the user supplied function can fail on the grouping column #20959

jreback · 2018-05-05T14:29:27Z

closes #20949

jreback · 2018-05-05T14:32:33Z

so this fixes the immediate issue, and provides a more generation soln, a context manager to set/reset the group_selection.

I removed most of the current usages which didn't break anything. There are a few more (calls to _set_group_selection that should use the context manager (because they are not paired with _reset_group_selection calls and hence are changing state on the groupby objects.

These actually should be fixed, but the tests have to be changed for the return result. I don't think this is actually an API change, rather some bugs in how we are using .apply.

This boils down to whether apply is a filtering operation or not, but you don't know this a-priori for a udf. But for built in functions we DO know. This comes up when you have a operation that fails with mixed dtypes (e.g. the grouper is a different type than the columns and the apply function cannot handle this).

…pes and the user supplied function can fail on the grouping column closes pandas-dev#20949

WillAyd · 2018-05-05T17:50:13Z

Very nice - I think using the context manager is a great way to handle this. Also agreed this isn't an API change (can't believe this has been around as long as it has - I noticed it back in 0.19.x).

FWIW I'm trying to think through when the Grouping column really should be included as part of the anonymous function. I would expect the following to be equivalent:

In [8]: >>> df = pd.DataFrame({'A': 'a a b'.split(), 'B': [1,2,3], 'C': [4,6, 5]})
   ...: >>> g = df.groupby('A')
   ...: >>> g.shift()
   ...: 
Out[8]: 
     B    C
0  NaN  NaN
1  1.0  4.0
2  NaN  NaN

In [9]: >>> df = pd.DataFrame({'A': 'a a b'.split(), 'B': [1,2,3], 'C': [4,6, 5]})
   ...: >>> g = df.groupby('A')
   ...: >>> g.apply(lambda x: x.shift())
   ...: 
Out[9]: 
     A    B    C
0  NaN  NaN  NaN
1    a  1.0  4.0
2  NaN  NaN  NaN

Same with these:

In [18]: >>> df = pd.DataFrame({'A': 'a a b'.split(), 'B': [1,2,3], 'C': [4,6, 5]})
    ...: >>> g = df.groupby('A')
    ...: >>> g.apply(lambda x: x.sum())
    ...: 
Out[18]: 
    A  B   C
A           
a  aa  3  10
b   b  3   5

In [19]: >>> df = pd.DataFrame({'A': 'a a b'.split(), 'B': [1,2,3], 'C': [4,6, 5]})
    ...: >>> g = df.groupby('A')
    ...: >>> g.sum()
    ...: 
Out[19]: 
   B   C
A       
a  3  10
b  3   5

Though perhaps there are legitimate cases where the anonymous function should be applied to the grouping column as well (?). Don't need to solve here but bringing up as food for thought

Dr-Irv · 2018-05-06T00:13:06Z

@jreback FYI, I found this while researching issues on .get() (#20885), in tests/groupby/aggregate/test_other.py in test_agg_timezone_round_trip, where the apply method is called as the last of a sequence. If you move the apply to the first in the sequence, that bugs out.

codecov · 2018-05-07T10:47:41Z

Codecov Report

Merging #20959 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #20959      +/-   ##
==========================================
+ Coverage   91.81%   91.82%   +<.01%     
==========================================
  Files         153      153              
  Lines       49481    49491      +10     
==========================================
+ Hits        45430    45443      +13     
+ Misses       4051     4048       -3

Flag	Coverage Δ
#multiple	`90.21% <100%> (ø)`	⬆️
#single	`41.84% <5.55%> (-0.01%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/groupby/groupby.py	`92.66% <100%> (+0.08%)`	⬆️
pandas/util/testing.py	`84.59% <0%> (+0.2%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bd4332f...a67f4d0. Read the comment docs.

jreback added Bug Groupby labels May 5, 2018

jreback added this to the 0.23.0 milestone May 5, 2018

jreback mentioned this pull request May 5, 2018

Using apply on a grouper works only if done after another operation on grouper #20958

Closed

jreback added 2 commits May 5, 2018 10:48

BUG in .groupby.apply when applying a function that has mixed data ty…

41d930b

…pes and the user supplied function can fail on the grouping column closes pandas-dev#20949

use a context manager

881c7e1

typo

a67f4d0

jreback force-pushed the gapply branch from 4a0ca5e to a67f4d0 Compare May 7, 2018 10:04

jreback merged commit 620784f into pandas-dev:master May 8, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG in .groupby.apply when applying a function that has mixed data types and the user supplied function can fail on the grouping column #20959

BUG in .groupby.apply when applying a function that has mixed data types and the user supplied function can fail on the grouping column #20959

jreback commented May 5, 2018

jreback commented May 5, 2018

WillAyd commented May 5, 2018

Dr-Irv commented May 6, 2018

codecov bot commented May 7, 2018 •

edited

Loading

BUG in .groupby.apply when applying a function that has mixed data types and the user supplied function can fail on the grouping column #20959

BUG in .groupby.apply when applying a function that has mixed data types and the user supplied function can fail on the grouping column #20959

Conversation

jreback commented May 5, 2018

jreback commented May 5, 2018

WillAyd commented May 5, 2018

Dr-Irv commented May 6, 2018

codecov bot commented May 7, 2018 • edited Loading

Codecov Report

codecov bot commented May 7, 2018 •

edited

Loading