Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Applying function on column of Groupby object with as_index=False does not select column #5764

Closed
jorisvandenbossche opened this issue Dec 23, 2013 · 5 comments · Fixed by #30554
Labels
Apply Apply, Aggregate, Transform, Map good first issue Groupby Needs Tests Unit test(s) needed to prevent regressions
Milestone

Comments

@jorisvandenbossche
Copy link
Member

>>> df = pd.DataFrame([[1, 2], [1, 4], [5, 6]], columns=['A', 'B'])
>>> df
   A  B
0  1  2
1  1  4
2  5  6
[3 rows x 2 columns]

Selecting a column of the GroupBy object, still returns all columns:

>>> g = df.groupby('A', as_index=False)['B']
>>> g.get_group(1)
   A  B
0  1  2
1  1  4
[2 rows x 2 columns]
>>> g = df.groupby('A', as_index=False)
>>> g.get_group(1)
   A  B
0  1  2
1  1  4
[2 rows x 2 columns]
>>> g.get_group(1)['B']
0    2
1    4
Name: B, dtype: int64

So an applied function with apply is applied on all columns:

>>> df.groupby('A', as_index=False)['B'].apply(lambda x: x.cumsum())
   A  B
0  1  2
1  2  6
2  5  6
[3 rows x 2 columns]

With as_index=True it works as expected:

>>> g = df.groupby('A')
>>> g.get_group(1)
   A  B
0  1  2
1  1  4
[2 rows x 2 columns]

>>> g = df.groupby('A')['B']
>>> g.get_group(1)
0    2
1    4
Name: B, dtype: int64

>>> df.groupby('A')['B'].apply(lambda x: x.cumsum())
0    2
1    6
2    6
dtype: int64

A more elaborate example where this turned out:

>>> s="""L1  L2  L3
... X   1   200
... X   2   100
... Z   1   15
... X   3   200
... Z   2   10
... Y   1   1
... Z   3   20
... Y   2   10
... Y   3   100"""
>>> 
>>> df = pd.read_csv(StringIO(s), sep="\s+")
>>> df.groupby("L1")["L3"].apply(lambda x: x.order().cumsum()/x.sum())
L1   
X   1    0.200000
    0    0.600000
    3    1.000000
Y   5    0.009009
    7    0.099099
    8    1.000000
Z   4    0.222222
    2    0.555556
    6    1.000000
dtype: float64

But if I don't want the X, Y, Z in the index:

>>> df.groupby("L1", as_index=False)["L3"].apply(lambda x: x.order().cumsum()/x.sum())

return an error as x is a dataframe.

@jreback
Copy link
Contributor

jreback commented Mar 22, 2014

On current master

this looks ok @jorisvandenbossche

after @hayd and @TomAugspurger recent changes

yes?

In [9]: df = pd.DataFrame([[1, 2], [1, 4], [5, 6]], columns=['A', 'B'])

In [10]: g = df.groupby('A', as_index=False)['B']

In [11]: g.get_group(1)
Out[11]: 
0    2
1    4
Name: B, dtype: int64

In [12]: g = df.groupby('A', as_index=False)

In [13]: g.get_group(1)
Out[13]: 
   A  B
0  1  2
1  1  4

[2 rows x 2 columns]

In [14]: g.get_group(1)['B']
Out[14]: 
0    2
1    4
Name: B, dtype: int64

In [15]: df.groupby('A', as_index=False)['B'].apply(lambda x: x.cumsum())
Out[15]: 
0    2
1    6
2    6
dtype: int64

@jreback jreback added this to the 0.14.0 milestone Mar 22, 2014
@hayd
Copy link
Contributor

hayd commented Mar 22, 2014

We should probably add some tests before closing. (I'm sure this came up recently on SO too.)

@jreback
Copy link
Contributor

jreback commented Mar 22, 2014

yep

@jreback jreback modified the milestones: 0.14.1, 0.14.0 May 1, 2014
@jreback
Copy link
Contributor

jreback commented May 1, 2014

This is now better in current master after #7000, [7] is going to be addressed in #5755

In [3]: df = pd.DataFrame([[1, 2], [1, 4], [5, 6]], columns=['A', 'B'])

In [4]: df
Out[4]: 
   A  B
0  1  2
1  1  4
2  5  6

[3 rows x 2 columns]

In [5]: g = df.groupby('A', as_index=False)['B']

In [6]: g.get_group(1)
Out[6]: 
0    2
1    4
Name: B, dtype: int64

In [7]: g = df.groupby('A', as_index=False)

In [8]: g.get_group(1)
Out[8]: 
   A  B
0  1  2
1  1  4

[2 rows x 2 columns]

In [9]: g.get_group(1)['B']
Out[9]: 
0    2
1    4
Name: B, dtype: int64

In [10]: df.groupby('A', as_index=False)['B'].apply(lambda x: x.cumsum())
Out[10]: 
0    2
1    6
2    6
dtype: int64

@jreback jreback modified the milestones: 0.15.0, 0.14.1 May 1, 2014
@jreback jreback modified the milestones: 0.16.0, Next Major Release Mar 3, 2015
@datapythonista datapythonista modified the milestones: Contributions Welcome, Someday Jul 8, 2018
@mroeschke mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed API Design Bug labels Sep 29, 2019
@jbrockmendel jbrockmendel added the Apply Apply, Aggregate, Transform, Map label Oct 16, 2019
@jbrockmendel
Copy link
Member

@jorisvandenbossche the behavior on master for the get_group looks right to me now, but the cumsum looks sketchy. can you confirm?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apply Apply, Aggregate, Transform, Map good first issue Groupby Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants