Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG in .groupby.apply when applying a function that has mixed data types and the user supplied function can fail on the grouping column #20959

Merged
merged 3 commits into from
May 8, 2018

Conversation

jreback
Copy link
Contributor

@jreback jreback commented May 5, 2018

closes #20949

@jreback jreback added this to the 0.23.0 milestone May 5, 2018
@jreback
Copy link
Contributor Author

jreback commented May 5, 2018

cc @WillAyd @Dr-Irv

so this fixes the immediate issue, and provides a more generation soln, a context manager to set/reset the group_selection.

I removed most of the current usages which didn't break anything. There are a few more (calls to _set_group_selection that should use the context manager (because they are not paired with _reset_group_selection calls and hence are changing state on the groupby objects.

These actually should be fixed, but the tests have to be changed for the return result. I don't think this is actually an API change, rather some bugs in how we are using .apply.

This boils down to whether apply is a filtering operation or not, but you don't know this a-priori for a udf. But for built in functions we DO know. This comes up when you have a operation that fails with mixed dtypes (e.g. the grouper is a different type than the columns and the apply function cannot handle this).

…pes and the user supplied function can fail on the grouping column

closes pandas-dev#20949
@WillAyd
Copy link
Member

WillAyd commented May 5, 2018

Very nice - I think using the context manager is a great way to handle this. Also agreed this isn't an API change (can't believe this has been around as long as it has - I noticed it back in 0.19.x).

FWIW I'm trying to think through when the Grouping column really should be included as part of the anonymous function. I would expect the following to be equivalent:

In [8]: >>> df = pd.DataFrame({'A': 'a a b'.split(), 'B': [1,2,3], 'C': [4,6, 5]})
   ...: >>> g = df.groupby('A')
   ...: >>> g.shift()
   ...: 
Out[8]: 
     B    C
0  NaN  NaN
1  1.0  4.0
2  NaN  NaN

In [9]: >>> df = pd.DataFrame({'A': 'a a b'.split(), 'B': [1,2,3], 'C': [4,6, 5]})
   ...: >>> g = df.groupby('A')
   ...: >>> g.apply(lambda x: x.shift())
   ...: 
Out[9]: 
     A    B    C
0  NaN  NaN  NaN
1    a  1.0  4.0
2  NaN  NaN  NaN

Same with these:

In [18]: >>> df = pd.DataFrame({'A': 'a a b'.split(), 'B': [1,2,3], 'C': [4,6, 5]})
    ...: >>> g = df.groupby('A')
    ...: >>> g.apply(lambda x: x.sum())
    ...: 
Out[18]: 
    A  B   C
A           
a  aa  3  10
b   b  3   5

In [19]: >>> df = pd.DataFrame({'A': 'a a b'.split(), 'B': [1,2,3], 'C': [4,6, 5]})
    ...: >>> g = df.groupby('A')
    ...: >>> g.sum()
    ...: 
Out[19]: 
   B   C
A       
a  3  10
b  3   5

Though perhaps there are legitimate cases where the anonymous function should be applied to the grouping column as well (?). Don't need to solve here but bringing up as food for thought

@Dr-Irv
Copy link
Contributor

Dr-Irv commented May 6, 2018

@jreback FYI, I found this while researching issues on .get() (#20885), in tests/groupby/aggregate/test_other.py in test_agg_timezone_round_trip, where the apply method is called as the last of a sequence. If you move the apply to the first in the sequence, that bugs out.

@codecov
Copy link

codecov bot commented May 7, 2018

Codecov Report

Merging #20959 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #20959      +/-   ##
==========================================
+ Coverage   91.81%   91.82%   +<.01%     
==========================================
  Files         153      153              
  Lines       49481    49491      +10     
==========================================
+ Hits        45430    45443      +13     
+ Misses       4051     4048       -3
Flag Coverage Δ
#multiple 90.21% <100%> (ø) ⬆️
#single 41.84% <5.55%> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/core/groupby/groupby.py 92.66% <100%> (+0.08%) ⬆️
pandas/util/testing.py 84.59% <0%> (+0.2%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bd4332f...a67f4d0. Read the comment docs.

@jreback jreback merged commit 620784f into pandas-dev:master May 8, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

pandas.core.groupby.GroupBy.apply fails
3 participants