Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: first/last lose timezone in groupby with as_index=False #21573

Merged
merged 4 commits into from
Jun 22, 2018

Conversation

reidy-p
Copy link
Contributor

@reidy-p reidy-p commented Jun 21, 2018

@@ -4740,7 +4740,10 @@ def _wrap_transformed_output(self, output, names=None):

def _wrap_agged_blocks(self, items, blocks):
if not self.as_index:
index = np.arange(blocks[0].values.shape[1])
if blocks[0].values.ndim > 1:
Copy link
Contributor Author

@reidy-p reidy-p Jun 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a case such as:

pd.DataFrame({'time': [pd.Timestamp('2012-01-01 13:00:00+00:00')],
              'A': [3]}).groupby('A', as_index=False).first()

the blocks[0].values is a DatetimeIndex and not an array so trying to call shape[1] on a DTI results in an index out of range error and then the compat routine for first or last are called which leads to the timezone being lost.

Copy link
Member

@mroeschke mroeschke Jun 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the compat routine for first or last are called which leads to the timezone being lost.

Is there a case where DatetimeTZ data can legitimately go through the compat routine (maybe with as_index=True?) It would be great for the data type to also be preserved there as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I have also been wondering about the same thing. The following case goes through the compat routine and consequently loses the timezone:

In [2]: df = pd.DataFrame({'group': [1, 1, 2],
                           'category_string': pd.Series(list('abc')).astype('category'),
                           'datetimetz': pd.date_range('20130101', periods=3, tz='US/Eastern'})
In [3]: df.groupby('group').first()
Out[3]: 
      category_string          datetimetz                                                
group                                                                                                                                                     
1                   a 2013-01-01 05:00:00                                                                      
2                   c 2013-01-03 05:00:00    

But if we exclude the categorical column it doesn't go through the compat routine and preserves the timezone information:

In[4]: df[['group', 'datetimetz']].groupby('group').first()
Out[4]:
                     datetimetz                                                       
group                                                                  
1     2013-01-01 00:00:00-05:00                                                            
2     2013-01-03 00:00:00-05:00

So if we have the categorical column do we want to legitimately go through the compat routine? And, if so, should we preserve the timezone in the compat routine? I think this might actually be quite straightforward (see #15885)

@reidy-p reidy-p changed the title BUG: first/last lose timezone in groupby BUG: first/last lose timezone in groupby with as_index=False Jun 21, 2018
@@ -225,7 +225,7 @@ Plotting
Groupby/Resample/Rolling
^^^^^^^^^^^^^^^^^^^^^^^^

-
- Bug in :func:`pandas.core.groupby.first` and :func:`pandas.core.groupby.last` with ``as_index=False`` leading to the loss of timezone information (:issue:`15884`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the :func: links should be pandas.core.groupby.GroupBy.first (and likewise for last)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, you're right, thanks!

@mroeschke mroeschke added Bug Timezones Timezone data dtype labels Jun 21, 2018
@codecov
Copy link

codecov bot commented Jun 21, 2018

Codecov Report

Merging #21573 into master will not change coverage.
The diff coverage is 100%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master   #21573   +/-   ##
=======================================
  Coverage    91.9%    91.9%           
=======================================
  Files         153      153           
  Lines       49549    49549           
=======================================
  Hits        45539    45539           
  Misses       4010     4010
Flag Coverage Δ
#multiple 90.3% <100%> (ø) ⬆️
#single 41.78% <0%> (ø) ⬆️
Impacted Files Coverage Δ
pandas/core/groupby/groupby.py 92.66% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 028c9c0...ef5d5b1. Read the comment docs.

@@ -4740,7 +4740,7 @@ def _wrap_transformed_output(self, output, names=None):

def _wrap_agged_blocks(self, items, blocks):
if not self.as_index:
index = np.arange(blocks[0].values.shape[1])
index = np.arange(blocks[0].values.shape[-1])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@reidy-p pushed a simplification. but maybe need some additional tests that do this when a column is selected

e.g. df.groupby('id', as_index=False)['foo'].first()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice simplification. I added some new tests.

@jreback jreback added this to the 0.24.0 milestone Jun 22, 2018
@reidy-p reidy-p force-pushed the groupby_tz branch 2 times, most recently from 7bb7a3b to 26ed691 Compare June 22, 2018 15:55
@mroeschke
Copy link
Member

@reidy-p

So if we have the categorical column do we want to legitimately go through the compat routine? And, if so, should we preserve the timezone in the compat routine?

I am not too familiar of the conditions in which the data gets passed through the compat routine, but timezones should be preserved in the compat routine. Yes, looks like #15885 should solve that issue (and offer a performance boost for Categoriacals as well xref #19026)

@jreback jreback merged commit c6347c4 into pandas-dev:master Jun 22, 2018
@jreback
Copy link
Contributor

jreback commented Jun 22, 2018

thanks @reidy-p very nice!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Timezones Timezone data dtype
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: first() loses the timezone in groupby
4 participants