Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Fix some PeriodIndex resampling issues #16153

Merged

Conversation

winklerand
Copy link
Contributor

@winklerand winklerand commented Apr 26, 2017

closes #15944
xref partially #12884
closes #13083
closes #13224

This PR addresses some of the issues related to PeriodIndex resampling.
As I'm new to the pandas codebase, I appreciate any advice.

@winklerand winklerand changed the title BUG: Fix some period index resampling issues BUG: Fix some PeriodIndex resampling issues Apr 26, 2017

with pytest.raises(NotImplementedError):
df.resample('2D', level='d')
for freq in ['H', '12H', '2D', 'W']:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these tests are more clear with parametrize
but we still inherit from TestCase so can't change this

but you can start a new test class that does this

again if it's more clear

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if I understood correctly - pushed 48e324f as an example, are you thinking along those lines of using pytest fixtures/mark.parametrize decorators?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

commented directly on the commit

@jreback jreback added Period Period data type Resample resample method labels Apr 27, 2017
@jreback
Copy link
Contributor

jreback commented Apr 27, 2017

lmk when you need review.

@codecov
Copy link

codecov bot commented Apr 27, 2017

Codecov Report

Merging #16153 into master will decrease coverage by 0.04%.
The diff coverage is 97.61%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #16153      +/-   ##
==========================================
- Coverage   91.27%   91.23%   -0.05%     
==========================================
  Files         163      163              
  Lines       49765    49774       +9     
==========================================
- Hits        45421    45409      -12     
- Misses       4344     4365      +21
Flag Coverage Δ
#multiple 89.02% <97.61%> (-0.03%) ⬇️
#single 40.33% <7.14%> (-0.07%) ⬇️
Impacted Files Coverage Δ
pandas/core/resample.py 96.05% <97.61%> (-0.12%) ⬇️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/plotting/_converter.py 63.38% <0%> (-1.82%) ⬇️
pandas/compat/numpy/function.py 93.12% <0%> (-0.21%) ⬇️
pandas/core/frame.py 97.73% <0%> (-0.1%) ⬇️
pandas/core/sparse/array.py 91.58% <0%> (-0.01%) ⬇️
pandas/core/indexes/interval.py 92.85% <0%> (ø) ⬆️
pandas/core/tools/datetimes.py 83.79% <0%> (+0.82%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ad7d051...6084e0c. Read the comment docs.

@winklerand
Copy link
Contributor Author

winklerand commented Apr 29, 2017

ok, gave parametrized tests a shot in 68f02f7.
In my opinion, those are much more readable - it's clear at first sight which combinations of parameters/inputs are tested.

If you agree, I'd be happy to help moving/refactoring more of the tests in TestPeriodIndex. I'd like to tackle some more issues with PeriodIndex in a follow-up PR (convention parameter, maybe drop kind parameter altogether, ...) which would require better coverage of all parameter combinations/edge cases.

return DataFrame({'value': np.arange(len(index))}, index=index)

@pytest.fixture(params=[Series, DataFrame], scope='class')
def pandas_obj(self, request, index):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is probably a more elegant way to create such "meta-fixtures"? ;-)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would do it more like this

@pytest.fixture(params=[PI, DTI, TDI])
def index(klass):
     # construct an index
     return klass(.....)

@pytest.fixture
def series(index):
    return Series(.....)

@pytest.fixture
def frame(index):
     return DataFrame(....)

it will cartesian product the indexes separately for series/frame which would now be the fixture arg to a test function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tried another approach by keeping the existing inheritance structure - we can inherit fixtures defined in the Base class, index fixture defined in Base dispatches to the subclass-specific type via the existing _index_factory. If you want to take a look:
https://github.com/winklerand/pandas/blob/bbf01b312a2d8be0b3c7681b9a46a039f33fdcc/pandas/tests/test_resample_pytest.py

Seems easier to me than parametrizing the index fixture, there's a lot of index type-specific test logic.

Should we move out this discussion to another issue or PR? This looks like a more general issue on how to (gradually) transition to pytest.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NO

we are removing the base classes soon

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry u meant the Base class and not TestCase base class

in any event it's not more clear to me

Copy link
Contributor Author

@winklerand winklerand Apr 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, was maybe a bit confusing ;-)

pushed e497fc2, should illustrate what I meant: basically move fixtures to a Base_pytest(object) class which can be inherited by TestPeriodIndex_pytest, and DatetimeIndex_pytest. This resembles the existing test organization structure, but allows for parametrization using the common fixtures.


if self.loffset is not None:
if self.kind == 'period':
print('Warning: loffset -> convert PeriodIndex to timestamps')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use warnings.warn rather than print

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was this here before? we don't warn on things like this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a similar print-style warning is currently in master: https://github.com/pandas-dev/pandas/blob/master/pandas/core/resample.py#L800:L802

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I defer to @jreback

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed the warning in 8a4182e - any objections?

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks pretty good. I have to do another look on the actual tests.

elif is_superperiod(ax.freq, self.freq):
if how == 'ohlc':
# upsampling to subperiods is handled as an asfreq, which works
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add the issue reference here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

'an instance of %r' % type(ax).__name__)

memb = ax.asfreq(self.freq, how=self.convention)
# NaT handling as in pandas._lib.lib.generate_bins_dt64()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blank line

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

nat_count = 0
if memb.hasnans:
import warnings
with warnings.catch_warnings():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need this

memb._isnan already has this mask.


i8 = memb.asi8
freq_mult = self.freq.n
# when upsampling to subperiods, we need to generate enough bins
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blank line

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

freq_mult = self.freq.n
# when upsampling to subperiods, we need to generate enough bins
expected_bins_count = len(binner) * freq_mult
i8_extend = expected_bins_count - (i8[-1] - i8[0])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will fail if not len(memb)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some lines above (after masking out NaTs), we check for if not len(memb) and return and empty index, so this case should be caught.
Maybe put in an assert len(memb) to make it explicit?

freq=self._index_fixture_freq)

@pytest.fixture(params=[Series, DataFrame], scope='class')
def pandas_obj(self, request, index):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make this 2 fixtures, series and frame, much simpler, and easier to grok.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can then make another fixture like series_and_frame if you want, though I find that more confusing, rather have you use params.

@@ -2777,6 +2694,211 @@ def test_evenly_divisible_with_no_extra_bins(self):
assert_frame_equal(result, expected)


def assert_series_or_frame_equal(result, expected):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is prob ok, but move to util.testing

pass


class TestPeriodIndex_pytest(Base_pytest):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather name these TestPeriodIndex(Base) and rename the original tests I think (and then we should move them in other PR's)

@winklerand winklerand force-pushed the fix-some-PeriodIndex-resampling-issues branch from 8a4182e to 09d7ac3 Compare May 6, 2017 07:31
@winklerand
Copy link
Contributor Author

winklerand commented May 6, 2017

Rebased on current master, so we can put the parametrized test in the existing test classes (which just inherit from object).
Thanks for feedback, all requested changed should be completed.

@jreback
Copy link
Contributor

jreback commented May 6, 2017

cc @MaximilianR can you have a look.

@max-sixty
Copy link
Contributor

Looks great. Fixes a couple of important issues.

Whether or not we reveal kind on resamples is an open question - I think there already enough tools for switching between DateTimeIndex and PeriodIndex that we don't need more in resample.

But this definitely makes things better! Thanks @winklerand

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

really just some minor stylistic changes.

can you fully close all of the mentioned issues at the top of the PR, and then create a new issue with any remnants. (ideally with check boxes).
Also pls create a new issue for test_resample reorg as I suggested. thanks.

bins = memb.searchsorted(rng, side='left')

if nat_count > 0:
# NaT handling as in pandas._lib.lib.generate_bins_dt64()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this path tested sufficiently, e.g. 0, 1, 2 NaT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a test case for consecutive NaTs in the index (1cad7fa)

Should be sufficiently tested, cases covered:

  • 0 NaT: basically all other resampling tests
  • multiple single NaTs (at beginning, inside and end of index)
  • consecutive NaTs (at beginning, inside and end of index)

Any ideas for more exhaustive test cases?

@@ -696,6 +696,32 @@ def create_index(self, *args, **kwargs):
factory = self._index_factory()
return factory(*args, **kwargs)

_index_fixture_start = datetime(2005, 1, 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as a followup PR. test_resample is getting sufficiently long that it makes sense to split this into some sub-dirs e.g.

pandas/tests/resample/test_api, test_grouping, test_datetimeindex, test_period etc. IOW split out the classes to separate files (to start).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we would normally co-locate fixtures into a common.py

_index_fixture_end = datetime(2005, 1, 10)
_index_fixture_freq = 'D'

@pytest.fixture(scope='class')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

normally don't use scoping like this (and instead use default function scoping). The issue is that we could have mutating things which will then propogate (we shouldn't in any event).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense, changed to default function scoping (6fe6c0c)

@@ -696,6 +696,32 @@ def create_index(self, *args, **kwargs):
factory = self._index_factory()
return factory(*args, **kwargs)

_index_fixture_start = datetime(2005, 1, 1)
_index_fixture_end = datetime(2005, 1, 10)
_index_fixture_freq = 'D'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these can be fixtures themselves

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@pytest.mark.parametrize('freq', ['2D'])
@pytest.mark.parametrize('kind', ['period', None, 'timestamp'])
def test_asfreq_downsample(self, series_and_frame, freq, kind):
# GH 12884, 15944
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

give a 1-liner about the issue that you are specifiying test when not obvious (e.g. here)

start = obj.index[0].to_timestamp(how='start')
end = (obj.index[-1] + 1).to_timestamp(how='start')
if kind == 'timestamp':
expected = obj.to_timestamp().resample(freq).asfreq()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this test be combined with prior test? (looks like you can just add to the freq), or is there a reason not?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, parametrization makes this obvious. In fact, both tests cover almost the same code path, so I consolidated them into one (f8f4157).
Reason basically was not to change the existing structure - the same methods are defined in the Base class. I consolidated them as well in the Base class - PTAL.

expected_index = self.create_index(df.index[0],
periods=len(df.index) / 2,
freq='2D')
# loffset coreces PeriodIndex to DateTimeIndex
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blank line

result_how = df.resample('2D', how=agg_arg, loffset='2H',
kind=kind)
if isinstance(agg_arg, list):
expected.columns = pd.MultiIndex .from_tuples([('value', 'mean')])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you have an odd space after pd.MultiIndex

pi = PeriodIndex(start='2000', freq='D', periods=10)
s = Series(range(len(pi)), index=pi)
expected = s.to_timestamp().resample(freq).ohlc().to_period(freq)
# timestamp-based resampling doesn't include all sub-periods
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blank line

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

assert_frame_equal(result, expected)

def test_resample_with_only_nat(self):
pi = PeriodIndex([pd.NaT] * 3, freq='S')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue number for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done (not mentioned in the issue explicitly, is just an edge case)

@jreback
Copy link
Contributor

jreback commented May 7, 2017

Whether or not we reveal kind on resamples is an open question - I think there already enough tools for switching between DateTimeIndex and PeriodIndex that we don't need more in resample.

For now this is fine. Yes I think we should remove this as PI becomes more flexible. But that's for another time.

@jreback jreback added this to the 0.21.0 milestone May 7, 2017
@jreback
Copy link
Contributor

jreback commented May 7, 2017

@winklerand biggest thing is going to need a small section in 0.21.0 to show these changes. In fact I find that writing these examples forces organization on describing the actual changes. Then (in this issue or followup), ideally like to example the timeseries.rst docs w.r.t. PI resampling.

expected = s.reindex(new_index)
assert_series_equal(result, expected)
@pytest.mark.parametrize('freq', ['2D', '1H'])
def test_asfreq(self, series_and_frame, freq):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback
I consolidated the test_asfreq_downsample() and test_asfreq_upsample() tests in the Base class as well (so we don't have to override them in TestPeriodIndex). PTAL if everything's still correct and covered as before.

@winklerand
Copy link
Contributor Author

@jreback sorry for the delay on this one.

I think I made the requested changes, PTAL. Thanks for the great suggestions.

I put a small PeriodIndex resampling paragraph into the v0.21 whatsnew documentation, in the 'Backward incompatible API changes' section - is this the right place?

Going to organise the issues (closing the solved ones, open new one with the remaining issues) and create a test_resample reorg issue later tonight.

@gfyoung
Copy link
Member

gfyoung commented Jul 13, 2017

@jreback : Were we waiting for anything on this one?

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks pretty good. couple of comments.


.. code-block:: ipython

In [1]: pi = pd.period_range('2017-01', periods=12, freq='M')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the fixed things that are not changed (e.g. [1], [2]), do in a separate ipython block above (to avoid repeating in previous/new).

``PeriodIndex`` resampling
^^^^^^^^^^^^^^^^^^^^^^^^^^

In previous versions of pandas, resampling a ``Series``/``DataFrame`` indexed by a ``PeriodIndex`` returned a ``DatetimeIndex`` in some cases (:issue:`12884`). Resampling to a multiplied frequency now returns a ``PeriodIndex`` (:issue:`15944`).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this the only issue?


if self.loffset is not None:
# Cannot apply loffset/timedelta to PeriodIndex -> convert to
# timestamps
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we show a warning for this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should, given that we kind of did that before (it was print statement with a warning!?)

@jreback
Copy link
Contributor

jreback commented Jul 13, 2017

side issue, maybe we should split up test_resample.py similar to groupby, e.g. make a sub-dir with common, test_datetime, test_period, test_timedelta. to make this a bit more understandable (let's make a separate issue for that).

@jreback
Copy link
Contributor

jreback commented Aug 18, 2017

can you rebase. this looked really good. need to review once more.

@jreback
Copy link
Contributor

jreback commented Sep 23, 2017

@winklerand sorry let this linger. can you rebase.

@jreback jreback force-pushed the fix-some-PeriodIndex-resampling-issues branch from 48f1e76 to 0f63098 Compare September 28, 2017 14:29
@jreback
Copy link
Contributor

jreback commented Sep 28, 2017

ok I rebased, will look after passing

@jreback
Copy link
Contributor

jreback commented Sep 28, 2017

@MaximilianR can you have a look. in particular can you look at the original issues and see if we are fully / partially covering and make a note.


# if index contains no valid (non-NaT) values, return empty index
if not len(memb):
binner = labels = PeriodIndex(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could use _shallow_copy here, but this is OK

Copy link
Contributor

@jreback jreback Sep 29, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left this, ok for now.

winklerand and others added 18 commits September 29, 2017 08:15
Added tests to cover all code paths, moved the check to _convert_obj() and
removed then-redundant check from _upsample() method.
Exceptions:
- force conversion to DatetimeIndex by kind='timestamp' param
- if loffset is given, convert to timestamps in any case
Removed previously defined helper method assert_series_or_frame_equal(),
which behaved identical to assert_almost_equal()
@jreback jreback force-pushed the fix-some-PeriodIndex-resampling-issues branch from 0f63098 to 6084e0c Compare September 29, 2017 12:24
@jreback jreback merged commit 7d4a260 into pandas-dev:master Oct 1, 2017
@jreback
Copy link
Contributor

jreback commented Oct 1, 2017

thanks @winklerand nice fixes!

ghost pushed a commit to reef-technologies/pandas that referenced this pull request Oct 2, 2017
* 'master' of github.com:pandas-dev/pandas: (188 commits)
  Separate out _convert_datetime_to_tsobject (pandas-dev#17715)
  DOC: remove whatsnew note for xref pandas-dev#17131
  BUG: Regression in .loc accepting a boolean Index as an indexer (pandas-dev#17738)
  DEPR: Deprecate cdate_range and merge into bdate_range (pandas-dev#17691)
  CLN: replace %s syntax with .format in pandas.core: categorical, common, config, config_init (pandas-dev#17735)
  Fixed the memory usage explanation of categorical in gotchas from O(nm) to O(n+m) (pandas-dev#17736)
  TST: add backward compat for offset testing for pickles (pandas-dev#17733)
  remove unused time conversion funcs (pandas-dev#17711)
  DEPR: Deprecate convert parameter in take (pandas-dev#17352)
  BUG:Time Grouper bug fix when applied for list groupers (pandas-dev#17587)
  BUG: Fix some PeriodIndex resampling issues (pandas-dev#16153)
  BUG: Fix unexpected sort in groupby (pandas-dev#17621)
  DOC: Fixed typo in documentation for 'pandas.DataFrame.replace' (pandas-dev#17731)
  BUG: Fix series rename called with str altering name rather index (GH17407) (pandas-dev#17654)
  DOC: Add examples for MultiIndex.get_locs + cleanups (pandas-dev#17675)
  Doc improvements for IntervalIndex and Interval (pandas-dev#17714)
  BUG: DataFrame sort_values and multiple "by" columns fails to order NaT correctly
  Last of the timezones funcs (pandas-dev#17669)
  Add missing file to _pyxfiles, delete commented-out (pandas-dev#17712)
  update imports of DateParseError, remove unused imports from tslib (pandas-dev#17713)
  ...
@winklerand
Copy link
Contributor Author

thanks @jreback for pushing this one over the finish line, completely forgot about it (works for me syndrome ;-)), sorry!

@winklerand winklerand deleted the fix-some-PeriodIndex-resampling-issues branch October 6, 2017 17:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Period Period data type Resample resample method
Projects
None yet
4 participants