BUG: first/last lose timezone in groupby with as_index=False #21573

reidy-p · 2018-06-21T12:39:21Z

closes BUG: first() loses the timezone in groupby #15884
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

reidy-p · 2018-06-21T12:43:21Z

pandas/core/groupby/groupby.py

@@ -4740,7 +4740,10 @@ def _wrap_transformed_output(self, output, names=None):

    def _wrap_agged_blocks(self, items, blocks):
        if not self.as_index:
-            index = np.arange(blocks[0].values.shape[1])
+            if blocks[0].values.ndim > 1:


In a case such as:

pd.DataFrame({'time': [pd.Timestamp('2012-01-01 13:00:00+00:00')], 'A': [3]}).groupby('A', as_index=False).first()

the blocks[0].values is a DatetimeIndex and not an array so trying to call shape[1] on a DTI results in an index out of range error and then the compat routine for first or last are called which leads to the timezone being lost.

the compat routine for first or last are called which leads to the timezone being lost.

Is there a case where DatetimeTZ data can legitimately go through the compat routine (maybe with as_index=True?) It would be great for the data type to also be preserved there as well.

Yeah I have also been wondering about the same thing. The following case goes through the compat routine and consequently loses the timezone:

In [2]: df = pd.DataFrame({'group': [1, 1, 2], 'category_string': pd.Series(list('abc')).astype('category'), 'datetimetz': pd.date_range('20130101', periods=3, tz='US/Eastern'}) In [3]: df.groupby('group').first() Out[3]: category_string datetimetz group 1 a 2013-01-01 05:00:00 2 c 2013-01-03 05:00:00

But if we exclude the categorical column it doesn't go through the compat routine and preserves the timezone information:

In[4]: df[['group', 'datetimetz']].groupby('group').first() Out[4]: datetimetz group 1 2013-01-01 00:00:00-05:00 2 2013-01-03 00:00:00-05:00

So if we have the categorical column do we want to legitimately go through the compat routine? And, if so, should we preserve the timezone in the compat routine? I think this might actually be quite straightforward (see #15885)

jschendel · 2018-06-21T16:53:20Z

doc/source/whatsnew/v0.24.0.txt

@@ -225,7 +225,7 @@ Plotting
 Groupby/Resample/Rolling
 ^^^^^^^^^^^^^^^^^^^^^^^^

-
+- Bug in :func:`pandas.core.groupby.first` and :func:`pandas.core.groupby.last` with ``as_index=False`` leading to the loss of timezone information (:issue:`15884`)


I think the :func: links should be pandas.core.groupby.GroupBy.first (and likewise for last)

Yep, you're right, thanks!

codecov · 2018-06-21T19:17:53Z

Codecov Report

Merging #21573 into master will not change coverage.
The diff coverage is 100%.

@@           Coverage Diff           @@
##           master   #21573   +/-   ##
=======================================
  Coverage    91.9%    91.9%           
=======================================
  Files         153      153           
  Lines       49549    49549           
=======================================
  Hits        45539    45539           
  Misses       4010     4010

Flag	Coverage Δ
#multiple	`90.3% <100%> (ø)`	⬆️
#single	`41.78% <0%> (ø)`	⬆️

Impacted Files	Coverage Δ
pandas/core/groupby/groupby.py	`92.66% <100%> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 028c9c0...ef5d5b1. Read the comment docs.

jreback · 2018-06-22T10:32:16Z

pandas/core/groupby/groupby.py

@@ -4740,7 +4740,7 @@ def _wrap_transformed_output(self, output, names=None):

    def _wrap_agged_blocks(self, items, blocks):
        if not self.as_index:
-            index = np.arange(blocks[0].values.shape[1])
+            index = np.arange(blocks[0].values.shape[-1])


@reidy-p pushed a simplification. but maybe need some additional tests that do this when a column is selected

e.g. df.groupby('id', as_index=False)['foo'].first()

Nice simplification. I added some new tests.

mroeschke · 2018-06-22T17:02:09Z

@reidy-p

So if we have the categorical column do we want to legitimately go through the compat routine? And, if so, should we preserve the timezone in the compat routine?

I am not too familiar of the conditions in which the data gets passed through the compat routine, but timezones should be preserved in the compat routine. Yes, looks like #15885 should solve that issue (and offer a performance boost for Categoriacals as well xref #19026)

jreback · 2018-06-22T23:01:46Z

thanks @reidy-p very nice!

…dev#21573)

reidy-p commented Jun 21, 2018

View reviewed changes

reidy-p changed the title ~~BUG: first/last lose timezone in groupby~~ BUG: first/last lose timezone in groupby with as_index=False Jun 21, 2018

jschendel reviewed Jun 21, 2018

View reviewed changes

mroeschke added Bug Timezones Timezone data dtype labels Jun 21, 2018

jreback requested changes Jun 22, 2018

View reviewed changes

jreback added this to the 0.24.0 milestone Jun 22, 2018

reidy-p and others added 2 commits June 22, 2018 15:20

BUG: first/last lose timezone in groupby

4d7e1cf

simplify

f46ce84

reidy-p force-pushed the groupby_tz branch 2 times, most recently from 7bb7a3b to 26ed691 Compare June 22, 2018 15:55

Fix whatsnew and add tests

7408aab

reidy-p force-pushed the groupby_tz branch from 26ed691 to 7408aab Compare June 22, 2018 15:57

lint

ef5d5b1

jreback approved these changes Jun 22, 2018

View reviewed changes

jreback merged commit c6347c4 into pandas-dev:master Jun 22, 2018

mroeschke mentioned this pull request Jun 23, 2018

BUG: groupby.first/last loses timezone information followup #21603

Closed

reidy-p deleted the groupby_tz branch June 23, 2018 22:13

Sup3rGeo pushed a commit to Sup3rGeo/pandas that referenced this pull request Oct 1, 2018

BUG: first/last lose timezone in groupby with as_index=False (pandas-…

594b75e

…dev#21573)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: first/last lose timezone in groupby with as_index=False #21573

BUG: first/last lose timezone in groupby with as_index=False #21573

reidy-p commented Jun 21, 2018 •

edited

Loading

reidy-p Jun 21, 2018 •

edited

Loading

mroeschke Jun 21, 2018 •

edited

Loading

reidy-p Jun 22, 2018

jschendel Jun 21, 2018

reidy-p Jun 22, 2018

codecov bot commented Jun 21, 2018 •

edited

Loading

jreback Jun 22, 2018

reidy-p Jun 22, 2018

mroeschke commented Jun 22, 2018

jreback commented Jun 22, 2018

BUG: first/last lose timezone in groupby with as_index=False #21573

BUG: first/last lose timezone in groupby with as_index=False #21573

Conversation

reidy-p commented Jun 21, 2018 • edited Loading

reidy-p Jun 21, 2018 • edited Loading

Choose a reason for hiding this comment

mroeschke Jun 21, 2018 • edited Loading

Choose a reason for hiding this comment

reidy-p Jun 22, 2018

Choose a reason for hiding this comment

jschendel Jun 21, 2018

Choose a reason for hiding this comment

reidy-p Jun 22, 2018

Choose a reason for hiding this comment

codecov bot commented Jun 21, 2018 • edited Loading

Codecov Report

jreback Jun 22, 2018

Choose a reason for hiding this comment

reidy-p Jun 22, 2018

Choose a reason for hiding this comment

mroeschke commented Jun 22, 2018

jreback commented Jun 22, 2018

reidy-p commented Jun 21, 2018 •

edited

Loading

reidy-p Jun 21, 2018 •

edited

Loading

mroeschke Jun 21, 2018 •

edited

Loading

codecov bot commented Jun 21, 2018 •

edited

Loading