-
-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix IntervalDtype Bugs and Inconsistencies #18997
Fix IntervalDtype Bugs and Inconsistencies #18997
Conversation
doc/source/whatsnew/v0.23.0.txt
Outdated
@@ -294,6 +295,7 @@ Conversion | |||
- Bug in :meth:`DatetimeIndex.astype` when converting between timezone aware dtypes, and converting from timezone aware to naive (:issue:`18951`) | |||
- Bug in :class:`FY5253` where ``datetime`` addition and subtraction incremented incorrectly for dates on the year-end but not normalized to midnight (:issue:`18854`) | |||
- Bug in :class:`DatetimeIndex` where adding or subtracting an array-like of ``DateOffset`` objects either raised (``np.array``, ``pd.Index``) or broadcast incorrectly (``pd.Series``) (:issue:`18849`) | |||
- Bug in ``IntervalDtype`` when constructing two instances with subtype ``CategoricalDtype`` where the second instance used cached attributes from the first (:issue:`18980`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are you mixing IntervalDtype and CategoricalDtype here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm referring to the behavior where you can only store one CategoricalDtype
in the subtype cache:
In [12]: idt1 = IntervalDtype(CategoricalDtype(list('abc'), True))
In [13]: idt2 = IntervalDtype(CategoricalDtype(list('wxyz'), False))
In [14]: idt2.subtype
Out[14]: CategoricalDtype(categories=['a', 'b', 'c'], ordered=True)
Once you've created an IntervalDtype
with CategoricalDtype
subtype, you can't create any other unique ones. Since the caching was previously based on str(subtype)
, and str(CategoricalDtype(*))
always returns 'category'
, trying to create secondary instances, even with different categories
or ordered
, will always hit the cache and pull out the first one.
pandas/core/dtypes/dtypes.py
Outdated
# GH 18980: need to combine since str and hash individually may not | ||
# be unique, e.g. str(CategoricalDtype) always returns 'category', | ||
# and hash(np.dtype('<m8')) == hash(np.dtype('<m8[ns]')) | ||
key = ''.join([str(subtype), str(hash(subtype))]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what are you doing here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trying to find a way create unique keys for each subtype that doesn't result in unwanted collisions. The previous behavior was essentially key=str(subtype)
. That doesn't work for CategoricalDtype
subtypes though, since str(CategoricalDtype(*))
always returns 'category'
. You then end up with a collision if you try to create a second IntervalDtype
with CategoricalDtype
subtype, and get back an IntervalDtype
with attributes from the first CategoricalDtype
:
In [12]: idt1 = IntervalDtype(CategoricalDtype(list('abc'), True))
In [13]: idt2 = IntervalDtype(CategoricalDtype(list('wxyz'), False))
In [14]: idt2.subtype
Out[14]: CategoricalDtype(categories=['a', 'b', 'c'], ordered=True)
My initial thought to fix this was to use hash
instead of str
, so essentially key = hash(subtype)
. That also doesn't quite work, because some numpy dtypes hash to the same value even though they're different:
In [2]: np.dtype('<m8') == np.dtype('<m8[ns]')
Out[2]: False
In [3]: hash(np.dtype('<m8')) == hash(np.dtype('<m8[ns]'))
Out[3]: True
I originally implemented it with key = hash(subtype)
, and the tests failed on travis and appveyor, so it is something that comes up.
It looked like the dtypes that failed for str
and hash
were mutually exclusive, so making the key a combination of str
and hash
seemed like it would work. Certainly open to other solutions though, as it's not the cleanest method. Or should we just revert to str
and explicitly not support CategoricalDtype
as a subtype? Not sure how often that would actually come up.
pandas/tests/dtypes/test_dtypes.py
Outdated
i = IntervalDtype('interval') | ||
assert i.subtype == '' | ||
@pytest.mark.parametrize('subtype', [ | ||
'interval[int64]', 'Interval[int64]', 'int64', np.dtype('int64')]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, I though we only accept interval[...]
and not Interval[...]
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The existing behavior allows for Interval[...]
. Really just took those subtypes out of a for
loop and used parametrize
instead in an effort to clean up the test. The existing regex match string includes "I", and there is existing code checks against .title()
, so it looks like this was intentional:
pandas/pandas/core/dtypes/dtypes.py
Line 635 in 8433562
_match = re.compile(r"(I|i)nterval\[(?P<subtype>.+)\]") |
pandas/pandas/core/dtypes/dtypes.py
Lines 708 to 709 in 8433562
if isinstance(other, compat.string_types): | |
return other == self.name or other == self.name.title() |
I think this was copied from PeriodDtype
, because I see nearly identical code there. Doesn't look like it'd be too hard to remove this to enforce lowercase, if so desired.
pandas/tests/dtypes/test_dtypes.py
Outdated
|
||
i = IntervalDtype() | ||
@pytest.mark.parametrize('subtype', [None, 'interval', 'interval[]']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
interval[]
prob doesn't make sense, is there a reason to keep it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, just put that there for legacy purposes, since it was previously a valid representation:
In [4]: IntervalDtype('interval')
Out[4]: interval[]
Agree that it doesn't really make sense, and can remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
Codecov Report
@@ Coverage Diff @@
## master #18997 +/- ##
==========================================
+ Coverage 91.51% 91.54% +0.03%
==========================================
Files 148 148
Lines 48780 48777 -3
==========================================
+ Hits 44639 44652 +13
+ Misses 4141 4125 -16
Continue to review full report at Codecov.
|
is there utility in having an IntervalIndex that works on objects / categorical types generally? (as opposed to numeric [float,int,datetime]). |
I've been wondering the same thing, and have been going back and forth. It seems like we'd need to put in a fair number of guardrails for dtypes that aren't continuous (or continuous enough) where things like As a source of comparison, postgres CREATE TYPE inetrange AS RANGE (
SUBTYPE = inet
); would provide support for intervals of IP addresses. I don't think there'd be such easy extensions in our case if we only provided numeric support? In terms of use cases, everything that I'd use intervals for would be covered by numeric. I can imagine some scenarios where non-numeric could be useful though. For example, intervals with string endpoints essentially allows for prefix level searches: In [2]: iv = pd.Interval('ab', 'bas')
In [3]: 'abracadabra' in iv
Out[3]: True
In [4]: 'bar' in iv
Out[4]: True
In [5]: 'bass' in iv
Out[5]: False Some quick searching turns up a postgres extension that appears do this. The real world use case mentioned there is telephony applications. I could maybe see similar prefix stuff being done in a biology/bioinformatics context, but really only have passing knowledge of that area, so could be wrong. I could see applications for ordered categoricals with categories along the lines of low/medium/high (but more categories), where you'd have logical intervals to partition by. Many problems could probably be solved via alternative approaches though, e.g. All that being said, I don't know how prevalent these use cases would actually be. Could very well not be worth the effort to support such operations. |
cc @zfrenchee any thoughts on the object / categorical (non-numeric) case for intervals? |
@jreback @jschendel I can't think of a use case for categorical intervals in my work, but that doesn't mean it can't exist. The telephony / ip address idea seems like it would be reasonable. The prefix search is interesting, but I imagine if you wanted a prefix tree, you wouldn't create a multi-intervalindex df (though it might work), but instead use a prefix-tree package. I might be cautious with permitting IntervalIndex of arbitrary subtype. Having a complex type system is nice when it works, but I wouldn't want to have to debug issues people brought me involving multi-IntervalIndexes of exotic subtypes. Not sure that's a good reason though. |
I agree. @jschendel if you wouldn't mind in this (or maybe better other) PR. to limit IntervalIndex dtypes that it will accept. (e.g. I think all except object, categorical). Then this PR becomes much simpler. |
b2da9c7
to
82cc71d
Compare
Updated to remove code for unsupported dtypes now that #19022 is merged. Also removed parsing |
lgtm. @TomAugspurger if you'd have a look. |
82cc71d
to
bd9805e
Compare
* Fix IntervalDtype Bugs and Inconsistencies * remove code for unsupported dtypes and remove 'interval[]'
git diff upstream/master -u -- "*.py" | flake8 --diff
Summary:
IntervalDtype(*)
now returnsTrue
when compared against'interval'
andIntervalDtype()
regardless of subtypeIntervalDtype(*)
now returns'interval'
regardless of subtypestr(IntervalDtype(*))
still displays subtype information, e.g.'interval[int64]'
IntervalDtype