-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: first/last lose timezone in groupby with as_index=False #21573
Conversation
pandas/core/groupby/groupby.py
Outdated
@@ -4740,7 +4740,10 @@ def _wrap_transformed_output(self, output, names=None): | |||
|
|||
def _wrap_agged_blocks(self, items, blocks): | |||
if not self.as_index: | |||
index = np.arange(blocks[0].values.shape[1]) | |||
if blocks[0].values.ndim > 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a case such as:
pd.DataFrame({'time': [pd.Timestamp('2012-01-01 13:00:00+00:00')],
'A': [3]}).groupby('A', as_index=False).first()
the blocks[0].values
is a DatetimeIndex
and not an array
so trying to call shape[1]
on a DTI
results in an index out of range error and then the compat routine for first
or last
are called which leads to the timezone being lost.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the compat routine for first or last are called which leads to the timezone being lost.
Is there a case where DatetimeTZ data can legitimately go through the compat routine (maybe with as_index=True
?) It would be great for the data type to also be preserved there as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I have also been wondering about the same thing. The following case goes through the compat routine and consequently loses the timezone:
In [2]: df = pd.DataFrame({'group': [1, 1, 2],
'category_string': pd.Series(list('abc')).astype('category'),
'datetimetz': pd.date_range('20130101', periods=3, tz='US/Eastern'})
In [3]: df.groupby('group').first()
Out[3]:
category_string datetimetz
group
1 a 2013-01-01 05:00:00
2 c 2013-01-03 05:00:00
But if we exclude the categorical column it doesn't go through the compat routine and preserves the timezone information:
In[4]: df[['group', 'datetimetz']].groupby('group').first()
Out[4]:
datetimetz
group
1 2013-01-01 00:00:00-05:00
2 2013-01-03 00:00:00-05:00
So if we have the categorical column do we want to legitimately go through the compat routine? And, if so, should we preserve the timezone in the compat routine? I think this might actually be quite straightforward (see #15885)
doc/source/whatsnew/v0.24.0.txt
Outdated
@@ -225,7 +225,7 @@ Plotting | |||
Groupby/Resample/Rolling | |||
^^^^^^^^^^^^^^^^^^^^^^^^ | |||
|
|||
- | |||
- Bug in :func:`pandas.core.groupby.first` and :func:`pandas.core.groupby.last` with ``as_index=False`` leading to the loss of timezone information (:issue:`15884`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the :func:
links should be pandas.core.groupby.GroupBy.first
(and likewise for last
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, you're right, thanks!
Codecov Report
@@ Coverage Diff @@
## master #21573 +/- ##
=======================================
Coverage 91.9% 91.9%
=======================================
Files 153 153
Lines 49549 49549
=======================================
Hits 45539 45539
Misses 4010 4010
Continue to review full report at Codecov.
|
@@ -4740,7 +4740,7 @@ def _wrap_transformed_output(self, output, names=None): | |||
|
|||
def _wrap_agged_blocks(self, items, blocks): | |||
if not self.as_index: | |||
index = np.arange(blocks[0].values.shape[1]) | |||
index = np.arange(blocks[0].values.shape[-1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@reidy-p pushed a simplification. but maybe need some additional tests that do this when a column is selected
e.g. df.groupby('id', as_index=False)['foo'].first()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice simplification. I added some new tests.
7bb7a3b
to
26ed691
Compare
I am not too familiar of the conditions in which the data gets passed through the compat routine, but timezones should be preserved in the compat routine. Yes, looks like #15885 should solve that issue (and offer a performance boost for Categoriacals as well xref #19026) |
thanks @reidy-p very nice! |
git diff upstream/master -u -- "*.py" | flake8 --diff