-
-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG in .groupby.apply when applying a function that has mixed data types and the user supplied function can fail on the grouping column #20959
Conversation
so this fixes the immediate issue, and provides a more generation soln, a context manager to set/reset the group_selection. I removed most of the current usages which didn't break anything. There are a few more (calls to These actually should be fixed, but the tests have to be changed for the return result. I don't think this is actually an API change, rather some bugs in how we are using This boils down to whether apply is a filtering operation or not, but you don't know this a-priori for a udf. But for built in functions we DO know. This comes up when you have a operation that fails with mixed dtypes (e.g. the grouper is a different type than the columns and the apply function cannot handle this). |
…pes and the user supplied function can fail on the grouping column closes pandas-dev#20949
Very nice - I think using the context manager is a great way to handle this. Also agreed this isn't an API change (can't believe this has been around as long as it has - I noticed it back in 0.19.x). FWIW I'm trying to think through when the Grouping column really should be included as part of the anonymous function. I would expect the following to be equivalent: In [8]: >>> df = pd.DataFrame({'A': 'a a b'.split(), 'B': [1,2,3], 'C': [4,6, 5]})
...: >>> g = df.groupby('A')
...: >>> g.shift()
...:
Out[8]:
B C
0 NaN NaN
1 1.0 4.0
2 NaN NaN
In [9]: >>> df = pd.DataFrame({'A': 'a a b'.split(), 'B': [1,2,3], 'C': [4,6, 5]})
...: >>> g = df.groupby('A')
...: >>> g.apply(lambda x: x.shift())
...:
Out[9]:
A B C
0 NaN NaN NaN
1 a 1.0 4.0
2 NaN NaN NaN Same with these: In [18]: >>> df = pd.DataFrame({'A': 'a a b'.split(), 'B': [1,2,3], 'C': [4,6, 5]})
...: >>> g = df.groupby('A')
...: >>> g.apply(lambda x: x.sum())
...:
Out[18]:
A B C
A
a aa 3 10
b b 3 5
In [19]: >>> df = pd.DataFrame({'A': 'a a b'.split(), 'B': [1,2,3], 'C': [4,6, 5]})
...: >>> g = df.groupby('A')
...: >>> g.sum()
...:
Out[19]:
B C
A
a 3 10
b 3 5 Though perhaps there are legitimate cases where the anonymous function should be applied to the grouping column as well (?). Don't need to solve here but bringing up as food for thought |
Codecov Report
@@ Coverage Diff @@
## master #20959 +/- ##
==========================================
+ Coverage 91.81% 91.82% +<.01%
==========================================
Files 153 153
Lines 49481 49491 +10
==========================================
+ Hits 45430 45443 +13
+ Misses 4051 4048 -3
Continue to review full report at Codecov.
|
closes #20949