-
-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lost timezone after groupby transform #24198
Comments
This is working correctly on my branch at #24024, but not on master. In [1]: paste
import pandas as pd
df = pd.DataFrame({'end_time': [pd.to_datetime('now', utc=True).tz_convert('Asia/Singapore')],
'id': [1]})
df['max_end_time'] = df.groupby('id').end_time.transform(max)
## -- End pasted text --
In [2]: df.dtypes
Out[2]:
end_time datetime64[ns, Asia/Singapore]
id int64
max_end_time datetime64[ns, Asia/Singapore]
dtype: object
In [3]: df
Out[3]:
end_time id max_end_time
0 2018-12-10 23:26:55.386177+08:00 1 2018-12-10 15:26:55.386177+08:00 I'll add this as a test case. |
Ahh... but it isn't correct over there either. The value is wrong. |
Yea I'm not terribly surprised by this as a lot of the cythonized groupby operations use the float value of timestamps for comparisons and may never cast back. Can such ops be generalized when working with tz-aware data? Is it ever ambiguous if crossing time zones to do such a thing? If it makes sense than wondering if this is something than an EA needs to handle appropriately |
Potential duplicate or at least related to #21603 |
@WillAyd Since it appears that the groupby routines convert to UTC first before running the aggregates, should be as straightforward to ensure that tz-aware columns undergo a |
Makes sense. Do we want the groupby ops to handle this special casing though or is this better fit in the EA implementation? The former would probably be less work atm and provide better performance, though the latter would probably fit better into our programming paradigm go forward |
I don't have a clear picture how generically the groupby ops will interface with the EA in this case. I don't think we'd want to do timezone conversion within EA internals (timezone localization would be okay though) i.e. I think the most important piece is that it works generically across all groupby ops and methods. |
The more I think about this I'm not sure it can actually be supported generically, at least not for operations that reduce. If we had two timestamps with the same UTC time but in different timezones which timezone would max return?
This would assume that the input always has the same timezone, no? |
Timestamps with different timezones would exist in an object dtype column, but yeah I am not sure how the max dispatch with object dtype would handle that scenario since they would be equivalent. But for more common ops, the timestamps should have the same timezone. I was making a naive assumption that |
Code Sample, a copy-pastable example if possible
df.info()
showsdf.to_dict()
showsProblem description
The timezone is dropped silently and timestamp converted to UTC after groupby - transform operation on tz aware datetime column
Expected Output
assert df['end_time'] == df['max_end_time']
Output of
pd.show_versions()
pandas: 0.23.4
pytest: 3.10.0
pip: 18.1
setuptools: 40.5.0
Cython: 0.29
numpy: 1.15.4
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 7.1.1
sphinx: None
patsy: 0.5.1
dateutil: 2.7.5
pytz: 2018.7
blosc: None
bottleneck: None
tables: None
numexpr: 2.6.8
feather: None
matplotlib: 3.0.1
openpyxl: None
xlrd: 1.1.0
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.6.3
html5lib: None
sqlalchemy: 1.2.13
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
The text was updated successfully, but these errors were encountered: