BUG: groupby.first/last loses timezone information followup #21603

mroeschke · 2018-06-23T02:28:46Z

In [2]: df = pd.DataFrame({'group': [1, 1, 2],
                           'category_string': pd.Series(list('abc')).astype('category'),
                           'datetimetz': pd.date_range('20130101', periods=3, tz='US/Eastern')})
In [3]: df.groupby('group').first()
Out[3]: 
      category_string          datetimetz                                                
group                                                                                                                                                     
1                   a 2013-01-01 05:00:00                                                                      
2                   c 2013-01-03 05:00:00

The example above passes data through the first/last compat method which strips timezone information. PR #15885 (now closed) should fix this issue (and offer a performance boost to Categorial data as mentioned in #19026)

The text was updated successfully, but these errors were encountered:

aggarwalvinayak · 2018-06-23T16:28:08Z

Hey i would like to take up this issue. Also i am new to open source..
Could you elaborate a bit more on what this issue is about ?

WillAyd · 2018-06-23T16:37:13Z

@mroeschke haven't looked in detail yet but do you know if there's a reason why we even have a compat function here? Under the impression the Cythonized version should work here, no?

mroeschke · 2018-06-23T17:43:22Z

@WillAyd Just quickly looking at the code, like a general first method was created and is invoked via apply for DataFrame groups and just directly on Series groups. I'm not too familiar with the groupby routines, but wouldn't bringing this into cython strip the current extension dtypes (i.e Categorical -> Object and Datetimetz -> Datetime)?

pandas/pandas/core/groupby/groupby.py

Line 1436 in f1aa08c

def first_compat(x, axis=0):

@aggarwalvinayak the datetimetz column above has lost timezone information when it was in the original DataFrame. The solution would be to impliment the changes in #15885

WillAyd · 2018-06-23T17:48:22Z

It may in it’s current form, but at least in theory I don’t think the first and last operations should have to behave differently based on the type being referenced. There are indexing operations for things like shift that can operate on any type arbitrarily, so seems logical that could apply to accessing the first and last elements as well.

That’s a larger change I can look into and perhaps deserves a separate issue. @aggarealvinayak feel free to continue diagnosing this as prescribed above

jreback · 2018-06-23T17:52:56Z

this needs to be fixed in cython i think

aggarwalvinayak · 2018-06-23T17:55:11Z

@mroeschke

The solution would be to impliment the changes in #15885

Do u mean to revert back the changes that were introduced in #15885
And @WillAyd Which procedure of diagnosing are you mentioning exactly?

mroeschke · 2018-06-23T19:17:22Z

Fair point @WillAyd, the indexing should be agnostic to the data types.

Alternatively, I was thinking; why isn't first/last just implemented in terms of the nth method? It looks to handle this issue correctly correctly and to be more performant:

In [4]: df = pd.DataFrame({'group': [1, 1, 2],
   ...:                            'category_string': pd.Series(list('abc')).astype('category'),
   ...:                            'datetimetz': pd.date_range('20130101', periods=3, tz='US/Eastern')})
   ...:

In [5]: df
Out[5]:
   group category_string                datetimetz
0      1               a 2013-01-01 00:00:00-05:00
1      1               b 2013-01-02 00:00:00-05:00
2      2               c 2013-01-03 00:00:00-05:00

In [6]: df.groupby('group').first() #wrong
Out[6]:
      category_string          datetimetz
group
1                   a 2013-01-01 05:00:00
2                   c 2013-01-03 05:00:00

In [7]: df.groupby('group').nth(0) # correct
Out[7]:
      category_string                datetimetz
group
1                   a 2013-01-01 00:00:00-05:00
2                   c 2013-01-03 00:00:00-05:00

In [8]: %timeit df.groupby('group').nth(0)
2.52 ms ± 30.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [9]: %timeit df.groupby('group').first()
14.8 ms ± 1.3 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

like how head/tail is just a wrapper around iloc. This may be a win/win unless first/last has a feature that nth doesn't have?

jreback · 2018-06-23T19:25:12Z

first/last i think could be in terms of nth (which came later) - nan handling is the same i think (that’s the defining issue and how they r different from head/tail)

WillAyd · 2018-06-23T19:26:55Z

If you take the compat out of the equation first is really just nth passing 1 as n:

pandas/pandas/core/groupby/groupby.py

Line 2411 in f1aa08c

'first': {

last is a separate implementation in Cython. It's a little different than first because with first n will always be 1, but with last n could vary across each group. Perhaps an intelligent consolidation could still occur back in Cython. Here's the implementation of that (nth is in the same module):

pandas/pandas/_libs/groupby_helper.pxi.in

Line 300 in f1aa08c

def group_last_{{name}}(ndarray[{{dest_type2}}, ndim=2] out,

mroeschke · 2018-06-23T20:12:29Z

Not sure if this is the right test, but it looks like nth supports negative indexing so shouldn't last always be -1?

In [3]: df.groupby('group').nth(-1)
Out[3]:
      category_string                datetimetz
group
1                   b 2013-01-02 00:00:00-05:00
2                   c 2013-01-03 00:00:00-05:00

@aggarwalvinayak what you may want to do then is start from @WillAyd comments above and try to investigate if its possible to implement first/last in terms of the nth method.

WillAyd · 2018-06-24T03:13:34Z

@aggarwalvinayak welcome to still work on this but just as a heads up I removed the "good first issue" tag as this is a little more complicated touching on Cython code

aggarwalvinayak · 2018-06-24T17:04:54Z

@WillAyd I am not at all experienced with cython.. Will try to explore about that first.. Because this is my second issue that the good first issue tag was removed because of cython thing.

mroeschke added Bug Timezones Timezone data dtype good first issue labels Jun 23, 2018

WillAyd removed the good first issue label Jun 24, 2018

mroeschke mentioned this issue Sep 30, 2018

BUG: Maintain column order with groupby.nth #22811

Merged

4 tasks

WillAyd mentioned this issue Dec 10, 2018

Lost timezone after groupby transform #24198

Closed

mroeschke mentioned this issue Jan 3, 2019

DEPR: __array__ for tz-aware Series/Index #24596

Merged

mroeschke mentioned this issue Feb 13, 2019

BUG: Groupby.agg with reduction function with tz aware data #25308

Merged

5 tasks

jreback added this to the 0.25.0 milestone Feb 16, 2019

mroeschke closed this as completed in #25308 Feb 28, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: groupby.first/last loses timezone information followup #21603

BUG: groupby.first/last loses timezone information followup #21603

mroeschke commented Jun 23, 2018 •

edited by WillAyd

Loading

aggarwalvinayak commented Jun 23, 2018

WillAyd commented Jun 23, 2018

mroeschke commented Jun 23, 2018

WillAyd commented Jun 23, 2018

jreback commented Jun 23, 2018

aggarwalvinayak commented Jun 23, 2018 •

edited

Loading

mroeschke commented Jun 23, 2018 •

edited

Loading

jreback commented Jun 23, 2018

WillAyd commented Jun 23, 2018 •

edited

Loading

mroeschke commented Jun 23, 2018

WillAyd commented Jun 24, 2018

aggarwalvinayak commented Jun 24, 2018

BUG: groupby.first/last loses timezone information followup #21603

BUG: groupby.first/last loses timezone information followup #21603

Comments

mroeschke commented Jun 23, 2018 • edited by WillAyd Loading

aggarwalvinayak commented Jun 23, 2018

WillAyd commented Jun 23, 2018

mroeschke commented Jun 23, 2018

WillAyd commented Jun 23, 2018

jreback commented Jun 23, 2018

aggarwalvinayak commented Jun 23, 2018 • edited Loading

mroeschke commented Jun 23, 2018 • edited Loading

jreback commented Jun 23, 2018

WillAyd commented Jun 23, 2018 • edited Loading

mroeschke commented Jun 23, 2018

WillAyd commented Jun 24, 2018

aggarwalvinayak commented Jun 24, 2018

mroeschke commented Jun 23, 2018 •

edited by WillAyd

Loading

aggarwalvinayak commented Jun 23, 2018 •

edited

Loading

mroeschke commented Jun 23, 2018 •

edited

Loading

WillAyd commented Jun 23, 2018 •

edited

Loading