Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: to_period behaves different between Timestamp and DatetimeIndex with timezones #22905

Closed
miccoli opened this issue Sep 30, 2018 · 7 comments · Fixed by #24634
Closed

BUG: to_period behaves different between Timestamp and DatetimeIndex with timezones #22905

miccoli opened this issue Sep 30, 2018 · 7 comments · Fixed by #24634
Labels
Datetime Datetime data dtype good first issue Needs Tests Unit test(s) needed to prevent regressions Period Period data type Timezones Timezone data dtype
Milestone

Comments

@miccoli
Copy link
Contributor

miccoli commented Sep 30, 2018

After 935c5d1 pandas/tests/indexes/datetimes/test_astype.py fails depending on the timezone:
in fact

TZ=Etc/GMT+1 pytest pandas/tests/indexes/datetimes/test_astype.py -k TestToPeriod

is OK while

TZ=Etc/GMT-1 pytest pandas/tests/indexes/datetimes/test_astype.py -k TestToPeriod

fails.

This can be nailed down to the following code sample.

Code Sample

>>> ts = pd.date_range('1/1/2000', '2/1/2000', tz='Etc/GMT-1')
>>> ts.to_period()[0] == ts[0].to_period()
False
>>> ts.to_period()[0], ts[0].to_period()
(Period('1999-12-31', 'D'), Period('2000-01-01', 'D'))
>>> ts = pd.date_range('1/1/2000', '2/1/2000', tz='Etc/GMT+1')
>>> ts.to_period()[0] == ts[0].to_period()
True
>>> ts.to_period()[0], ts[0].to_period()
(Period('2000-01-01', 'D'), Period('2000-01-01', 'D'))

Problem description

It seems that ts.to_period()[0] == ts[0].to_period() depends on tzinfo, true if west of GMT, false if east of GMT. Before 935c5d1 ts.to_period()[0] == ts[0].to_period() is always True irrespective of tzinfo.

Output of pd.show_versions()

INSTALLED VERSIONS
------------------
commit: 935c5d1453938a284805cb7d3c0c3255390997a7
python: 3.7.0.final.0
python-bits: 64
OS: Darwin
OS-release: 17.7.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.24.0.dev0+555.g935c5d145
pytest: 3.7.4
pip: 18.0
setuptools: 39.0.1
Cython: 0.28.5
numpy: 1.15.1
scipy: 1.1.0
pyarrow: None
xarray: 0.10.9
IPython: 6.5.0
sphinx: 1.7.9
patsy: 0.5.0
dateutil: 2.7.3
pytz: 2018.5
blosc: 1.5.1
bottleneck: 1.2.1
tables: None
numexpr: 2.6.8
feather: None
matplotlib: 3.0.0
openpyxl: 2.5.5
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.1.1
lxml: 4.2.5
bs4: 4.6.3
html5lib: 1.0.1
sqlalchemy: 1.2.12
pymysql: 0.9.2
psycopg2: None
jinja2: 2.10
s3fs: 0.1.6
fastparquet: 0.1.6
pandas_gbq: None
pandas_datareader: None
gcsfs: 0.1.2
@tm9k1
Copy link
Contributor

tm9k1 commented Sep 30, 2018

looking into this

@tm9k1
Copy link
Contributor

tm9k1 commented Sep 30, 2018

@miccoli , please read more about pandas.date_range() here
As you can see,
since you did not specify any 'clock time' , the default value will be taken as 00:00.
From documentation ,

"tz : str or tzinfo, optional
Time zone name for returning localized DatetimeIndex, ..."

putting tz=GMT-1 will cause backwards shift by 1 hour.
To understand more clearly, see this :-
>>> ts = pd.date_range('1/1/2000 1:00', '2/1/2000 1:00', tz='Etc/GMT-1')
^ Added "clock time" as 1:00 AM, since this will be used to compensate for GMT-1, because outputs will show LOCALIZED time
>>> ts.to_period()[0] == ts[0].to_period()
True
>>> ts.to_period()[0], ts[0].to_period()
(Period('2000-01-01', 'D'), Period('2000-01-01', 'D'))

@miccoli
Copy link
Contributor Author

miccoli commented Sep 30, 2018

@brute4s99

@pytest.mark.parametrize('tz', [
'US/Eastern', pytz.utc, tzlocal(), 'dateutil/US/Eastern',
dateutil.tz.tzutc()])
def test_to_period_tz(self, tz):
ts = date_range('1/1/2000', '2/1/2000', tz=tz)
with tm.assert_produces_warning(UserWarning):
# GH#21333 warning that timezone info will be lost
result = ts.to_period()[0]
expected = ts[0].to_period()
assert result == expected
expected = date_range('1/1/2000', '2/1/2000').to_period()
with tm.assert_produces_warning(UserWarning):
# GH#21333 warning that timezone info will be lost
result = ts.to_period()
tm.assert_index_equal(result, expected)

I just reproduced the situation in the test above, which fails east of greenwich.

Let me explain this better:

>>> ts = pd.date_range('1/1/2000', '2/1/2000', tz='Etc/GMT-1')
>>>
>>> ts[0]
Timestamp('2000-01-01 00:00:00+0100', tz='Etc/GMT-1', freq='D')
>>> ts[0].to_period()
Period('2000-01-01', 'D')
>>> 
>>> ts[:1]
DatetimeIndex(['2000-01-01 00:00:00+01:00'], dtype='datetime64[ns, Etc/GMT-1]', freq='D')
>>> ts[:1].to_period()
PeriodIndex(['1999-12-31'], dtype='period[D]', freq='D')
  • ts[0] is a Timestamp and its to_period method drops tzinfo without taking into account its value.

  • ts[:1] is a length 1 DatetimeIndex and its to_period method first converts to UTC and then drops tzinfo.

I'm not sure if it's a bug or a feature, but I feel more natural to assume that ts.to_period()[0] == ts[0].to_period() is an invariant that should hold regardless of tzinfo. A different question is to decide which one, Timestamp.to_period or DatetimeIndex.to_period is correct.

Finally if positively Timestamp.to_period and DatetimeIndex.to_period should have different semantics, then the tests are wrong.

@mroeschke
Copy link
Member

Nice catch @miccoli! Agreed to_period should behave similarly between Timestamp and DatetimeIndex.

I don't work with Periods much so I welcome others' opinions, but it would think the DatetimeIndex.to_period behavior is incorrect; I don't think I would want the 1999 result.

The UTC conversion happens here in self.values:

return PeriodIndex(self.values, name=self.name, freq=freq)

@mroeschke mroeschke added Datetime Datetime data dtype Timezones Timezone data dtype Period Period data type labels Sep 30, 2018
@mroeschke mroeschke changed the title pandas/tests/indexes/datetimes/test_astype.py fails depending on timezone BUG: to_period behaves different between Timestamp and DatetimeIndex with timezones Sep 30, 2018
@tm9k1
Copy link
Contributor

tm9k1 commented Oct 1, 2018

Can I work this bug, please ?

@mroeschke
Copy link
Member

Go for it @brute4s99. Open up the pull request when you are ready for a review.

@mroeschke
Copy link
Member

This looks to be fixed on master now. Could use a test.

@mroeschke mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions labels Jan 4, 2019
@jreback jreback modified the milestones: 0.24.0, Contributions Welcome Jan 5, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype good first issue Needs Tests Unit test(s) needed to prevent regressions Period Period data type Timezones Timezone data dtype
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants