Fix IntervalDtype Bugs and Inconsistencies #18997

jschendel · 2017-12-29T21:09:36Z

closes IntervalDtype inconsistencies and bugs #18980
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

Summary:

IntervalDtype(*) now returns True when compared against 'interval' and IntervalDtype() regardless of subtype
IntervalDtype(*) now returns 'interval' regardless of subtype
- str(IntervalDtype(*)) still displays subtype information, e.g. 'interval[int64]'
Cleaned up miscellaneous tests related to IntervalDtype

jreback · 2017-12-29T23:35:04Z

doc/source/whatsnew/v0.23.0.txt

@@ -294,6 +295,7 @@ Conversion
 - Bug in :meth:`DatetimeIndex.astype` when converting between timezone aware dtypes, and converting from timezone aware to naive (:issue:`18951`)
 - Bug in :class:`FY5253` where ``datetime`` addition and subtraction incremented incorrectly for dates on the year-end but not normalized to midnight (:issue:`18854`)
 - Bug in :class:`DatetimeIndex` where adding or subtracting an array-like of ``DateOffset`` objects either raised (``np.array``, ``pd.Index``) or broadcast incorrectly (``pd.Series``) (:issue:`18849`)
+- Bug in ``IntervalDtype`` when constructing two instances with subtype ``CategoricalDtype`` where the second instance used cached attributes from the first (:issue:`18980`)


are you mixing IntervalDtype and CategoricalDtype here?

I'm referring to the behavior where you can only store one CategoricalDtype in the subtype cache:

In [12]: idt1 = IntervalDtype(CategoricalDtype(list('abc'), True)) In [13]: idt2 = IntervalDtype(CategoricalDtype(list('wxyz'), False)) In [14]: idt2.subtype Out[14]: CategoricalDtype(categories=['a', 'b', 'c'], ordered=True)

Once you've created an IntervalDtype with CategoricalDtype subtype, you can't create any other unique ones. Since the caching was previously based on str(subtype), and str(CategoricalDtype(*)) always returns 'category', trying to create secondary instances, even with different categories or ordered, will always hit the cache and pull out the first one.

jreback · 2017-12-29T23:36:01Z

pandas/core/dtypes/dtypes.py

+            # GH 18980: need to combine since str and hash individually may not
+            # be unique, e.g. str(CategoricalDtype) always returns 'category',
+            # and hash(np.dtype('<m8')) == hash(np.dtype('<m8[ns]'))
+            key = ''.join([str(subtype), str(hash(subtype))])


what are you doing here?

Trying to find a way create unique keys for each subtype that doesn't result in unwanted collisions. The previous behavior was essentially key=str(subtype). That doesn't work for CategoricalDtype subtypes though, since str(CategoricalDtype(*)) always returns 'category'. You then end up with a collision if you try to create a second IntervalDtype with CategoricalDtype subtype, and get back an IntervalDtype with attributes from the first CategoricalDtype:

In [12]: idt1 = IntervalDtype(CategoricalDtype(list('abc'), True)) In [13]: idt2 = IntervalDtype(CategoricalDtype(list('wxyz'), False)) In [14]: idt2.subtype Out[14]: CategoricalDtype(categories=['a', 'b', 'c'], ordered=True)

My initial thought to fix this was to use hash instead of str, so essentially key = hash(subtype). That also doesn't quite work, because some numpy dtypes hash to the same value even though they're different:

In [2]: np.dtype('<m8') == np.dtype('<m8[ns]') Out[2]: False In [3]: hash(np.dtype('<m8')) == hash(np.dtype('<m8[ns]')) Out[3]: True

I originally implemented it with key = hash(subtype), and the tests failed on travis and appveyor, so it is something that comes up.

It looked like the dtypes that failed for str and hash were mutually exclusive, so making the key a combination of str and hash seemed like it would work. Certainly open to other solutions though, as it's not the cleanest method. Or should we just revert to str and explicitly not support CategoricalDtype as a subtype? Not sure how often that would actually come up.

jreback · 2017-12-29T23:37:00Z

pandas/tests/dtypes/test_dtypes.py

-        i = IntervalDtype('interval')
-        assert i.subtype == ''
+    @pytest.mark.parametrize('subtype', [
+        'interval[int64]', 'Interval[int64]', 'int64', np.dtype('int64')])


hmm, I though we only accept interval[...] and not Interval[...]?

The existing behavior allows for Interval[...]. Really just took those subtypes out of a for loop and used parametrize instead in an effort to clean up the test. The existing regex match string includes "I", and there is existing code checks against .title(), so it looks like this was intentional:

pandas/pandas/core/dtypes/dtypes.py

Line 635 in 8433562

_match = re.compile(r"(I|i)nterval\[(?P<subtype>.+)\]")

pandas/pandas/core/dtypes/dtypes.py

Lines 708 to 709 in 8433562

if isinstance(other, compat.string_types):

return other == self.name or other == self.name.title()

I think this was copied from PeriodDtype, because I see nearly identical code there. Doesn't look like it'd be too hard to remove this to enforce lowercase, if so desired.

jreback · 2017-12-29T23:37:28Z

pandas/tests/dtypes/test_dtypes.py


-        i = IntervalDtype()
+    @pytest.mark.parametrize('subtype', [None, 'interval', 'interval[]'])


interval[] prob doesn't make sense, is there a reason to keep it?

Yeah, just put that there for legacy purposes, since it was previously a valid representation:

In [4]: IntervalDtype('interval') Out[4]: interval[]

Agree that it doesn't really make sense, and can remove it.

codecov · 2017-12-30T00:09:07Z

Codecov Report

Merging #18997 into master will increase coverage by 0.03%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #18997      +/-   ##
==========================================
+ Coverage   91.51%   91.54%   +0.03%     
==========================================
  Files         148      148              
  Lines       48780    48777       -3     
==========================================
+ Hits        44639    44652      +13     
+ Misses       4141     4125      -16

Flag	Coverage Δ
#multiple	`89.91% <100%> (+0.03%)`	⬆️
#single	`41.6% <27.27%> (ø)`	⬆️

Impacted Files	Coverage Δ
pandas/core/dtypes/dtypes.py	`96.11% <100%> (+0.79%)`	⬆️
pandas/core/dtypes/common.py	`94.54% <0%> (+0.24%)`	⬆️
pandas/plotting/_converter.py	`66.95% <0%> (+1.73%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d5194e5...bd9805e. Read the comment docs.

jreback · 2017-12-30T20:25:08Z

is there utility in having an IntervalIndex that works on objects / categorical types generally? (as opposed to numeric [float,int,datetime]).

jschendel · 2017-12-30T23:06:48Z

is there utility in having an IntervalIndex that works on objects / categorical types generally?

I've been wondering the same thing, and have been going back and forth. It seems like we'd need to put in a fair number of guardrails for dtypes that aren't continuous (or continuous enough) where things like mid and length aren't guaranteed to be defined. I've been asked in other PR's to handle such edge cases, e.g. #18805 (comment). I'm not sure I'd classify that as demand, so much as reviewers just trying to ensure that all corners are covered.

As a source of comparison, postgres range types only provide numeric support by default. However, postgres also makes it fairly easy to extend this to non-numeric types, e.g.

CREATE TYPE inetrange AS RANGE (
    SUBTYPE = inet
);

would provide support for intervals of IP addresses. I don't think there'd be such easy extensions in our case if we only provided numeric support?

In terms of use cases, everything that I'd use intervals for would be covered by numeric. I can imagine some scenarios where non-numeric could be useful though. For example, intervals with string endpoints essentially allows for prefix level searches:

In [2]: iv = pd.Interval('ab', 'bas')

In [3]: 'abracadabra' in iv
Out[3]: True

In [4]: 'bar' in iv
Out[4]: True

In [5]: 'bass' in iv
Out[5]: False

Some quick searching turns up a postgres extension that appears do this. The real world use case mentioned there is telephony applications. I could maybe see similar prefix stuff being done in a biology/bioinformatics context, but really only have passing knowledge of that area, so could be wrong.

I could see applications for ordered categoricals with categories along the lines of low/medium/high (but more categories), where you'd have logical intervals to partition by. Many problems could probably be solved via alternative approaches though, e.g. groupby logic. Not sure if there'd be cases where Interval support would be far and away a superior solution. You could probably even solve such problems without explicit Categorical support by mapping label <--> integer, creating integer intervals, and mapping back as appropriate, though kind of seems like reinventing the Categorical wheel.

All that being said, I don't know how prevalent these use cases would actually be. Could very well not be worth the effort to support such operations.

jreback · 2017-12-30T23:13:17Z

cc @zfrenchee any thoughts on the object / categorical (non-numeric) case for intervals?

alexlenail · 2017-12-30T23:57:36Z

@jreback @jschendel I can't think of a use case for categorical intervals in my work, but that doesn't mean it can't exist. The telephony / ip address idea seems like it would be reasonable.

The prefix search is interesting, but I imagine if you wanted a prefix tree, you wouldn't create a multi-intervalindex df (though it might work), but instead use a prefix-tree package.

I might be cautious with permitting IntervalIndex of arbitrary subtype. Having a complex type system is nice when it works, but I wouldn't want to have to debug issues people brought me involving multi-IntervalIndexes of exotic subtypes. Not sure that's a good reason though.

jreback · 2017-12-31T00:01:52Z

I agree. @jschendel if you wouldn't mind in this (or maybe better other) PR. to limit IntervalIndex dtypes that it will accept. (e.g. I think all except object, categorical). Then this PR becomes much simpler.

jschendel · 2018-01-07T03:51:36Z

Updated to remove code for unsupported dtypes now that #19022 is merged. Also removed parsing 'interval[]' for IntervalDtype construction.

jreback · 2018-01-07T15:16:27Z

lgtm. @TomAugspurger if you'd have a look.

* Fix IntervalDtype Bugs and Inconsistencies * remove code for unsupported dtypes and remove 'interval[]'

jreback added API Design Bug Dtype Conversions Unexpected or buggy dtype conversions Interval Interval data type labels Dec 29, 2017

jreback requested changes Dec 29, 2017

View reviewed changes

jreback mentioned this pull request Dec 31, 2017

API: prohibit non-numeric dtypes in construction of IntervalIndex #19016

Closed

jschendel force-pushed the intervaldtype-fixes branch from b2da9c7 to 82cc71d Compare January 7, 2018 03:49

jreback added this to the 0.23.0 milestone Jan 7, 2018

jreback approved these changes Jan 7, 2018

View reviewed changes

jschendel added 2 commits January 8, 2018 11:04

Fix IntervalDtype Bugs and Inconsistencies

6b00114

remove code for unsupported dtypes and remove 'interval[]'

bd9805e

jschendel force-pushed the intervaldtype-fixes branch from 82cc71d to bd9805e Compare January 8, 2018 18:06

TomAugspurger merged commit 982e112 into pandas-dev:master Jan 10, 2018

maximveksler pushed a commit to maximveksler/pandas that referenced this pull request Jan 11, 2018

Fix IntervalDtype Bugs and Inconsistencies (pandas-dev#18997)

aaaeb0c

* Fix IntervalDtype Bugs and Inconsistencies * remove code for unsupported dtypes and remove 'interval[]'

jschendel deleted the intervaldtype-fixes branch September 24, 2018 17:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix IntervalDtype Bugs and Inconsistencies #18997

Fix IntervalDtype Bugs and Inconsistencies #18997

jschendel commented Dec 29, 2017 •

edited

Loading

jreback Dec 29, 2017

jschendel Dec 30, 2017

jreback Dec 29, 2017

jschendel Dec 30, 2017

jreback Dec 29, 2017

jschendel Dec 30, 2017

jreback Dec 29, 2017

jschendel Dec 30, 2017

jschendel Jan 7, 2018

codecov bot commented Dec 30, 2017 •

edited

Loading

jreback commented Dec 30, 2017

jschendel commented Dec 30, 2017

jreback commented Dec 30, 2017

alexlenail commented Dec 30, 2017 •

edited

Loading

jreback commented Dec 31, 2017

jschendel commented Jan 7, 2018

jreback commented Jan 7, 2018

	if isinstance(other, compat.string_types):
	return other == self.name or other == self.name.title()


		i = IntervalDtype()
		@pytest.mark.parametrize('subtype', [None, 'interval', 'interval[]'])

Fix IntervalDtype Bugs and Inconsistencies #18997

Fix IntervalDtype Bugs and Inconsistencies #18997

Conversation

jschendel commented Dec 29, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Dec 30, 2017 • edited Loading

Codecov Report

jreback commented Dec 30, 2017

jschendel commented Dec 30, 2017

jreback commented Dec 30, 2017

alexlenail commented Dec 30, 2017 • edited Loading

jreback commented Dec 31, 2017

jschendel commented Jan 7, 2018

jreback commented Jan 7, 2018

jschendel commented Dec 29, 2017 •

edited

Loading

codecov bot commented Dec 30, 2017 •

edited

Loading

alexlenail commented Dec 30, 2017 •

edited

Loading