Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix IntervalDtype Bugs and Inconsistencies #18997

Merged
merged 2 commits into from
Jan 10, 2018

Conversation

jschendel
Copy link
Member

@jschendel jschendel commented Dec 29, 2017

Summary:

  • IntervalDtype(*) now returns True when compared against 'interval' and IntervalDtype() regardless of subtype
  • IntervalDtype(*) now returns 'interval' regardless of subtype
    • str(IntervalDtype(*)) still displays subtype information, e.g. 'interval[int64]'
  • Cleaned up miscellaneous tests related to IntervalDtype

@jreback jreback added API Design Bug Dtype Conversions Unexpected or buggy dtype conversions Interval Interval data type labels Dec 29, 2017
@@ -294,6 +295,7 @@ Conversion
- Bug in :meth:`DatetimeIndex.astype` when converting between timezone aware dtypes, and converting from timezone aware to naive (:issue:`18951`)
- Bug in :class:`FY5253` where ``datetime`` addition and subtraction incremented incorrectly for dates on the year-end but not normalized to midnight (:issue:`18854`)
- Bug in :class:`DatetimeIndex` where adding or subtracting an array-like of ``DateOffset`` objects either raised (``np.array``, ``pd.Index``) or broadcast incorrectly (``pd.Series``) (:issue:`18849`)
- Bug in ``IntervalDtype`` when constructing two instances with subtype ``CategoricalDtype`` where the second instance used cached attributes from the first (:issue:`18980`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you mixing IntervalDtype and CategoricalDtype here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm referring to the behavior where you can only store one CategoricalDtype in the subtype cache:

In [12]: idt1 = IntervalDtype(CategoricalDtype(list('abc'), True))

In [13]: idt2 = IntervalDtype(CategoricalDtype(list('wxyz'), False))

In [14]: idt2.subtype
Out[14]: CategoricalDtype(categories=['a', 'b', 'c'], ordered=True)

Once you've created an IntervalDtype with CategoricalDtype subtype, you can't create any other unique ones. Since the caching was previously based on str(subtype), and str(CategoricalDtype(*)) always returns 'category', trying to create secondary instances, even with different categories or ordered, will always hit the cache and pull out the first one.

# GH 18980: need to combine since str and hash individually may not
# be unique, e.g. str(CategoricalDtype) always returns 'category',
# and hash(np.dtype('<m8')) == hash(np.dtype('<m8[ns]'))
key = ''.join([str(subtype), str(hash(subtype))])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are you doing here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying to find a way create unique keys for each subtype that doesn't result in unwanted collisions. The previous behavior was essentially key=str(subtype). That doesn't work for CategoricalDtype subtypes though, since str(CategoricalDtype(*)) always returns 'category'. You then end up with a collision if you try to create a second IntervalDtype with CategoricalDtype subtype, and get back an IntervalDtype with attributes from the first CategoricalDtype:

In [12]: idt1 = IntervalDtype(CategoricalDtype(list('abc'), True))

In [13]: idt2 = IntervalDtype(CategoricalDtype(list('wxyz'), False))

In [14]: idt2.subtype
Out[14]: CategoricalDtype(categories=['a', 'b', 'c'], ordered=True)

My initial thought to fix this was to use hash instead of str, so essentially key = hash(subtype). That also doesn't quite work, because some numpy dtypes hash to the same value even though they're different:

In [2]: np.dtype('<m8') == np.dtype('<m8[ns]')
Out[2]: False

In [3]: hash(np.dtype('<m8')) == hash(np.dtype('<m8[ns]'))
Out[3]: True

I originally implemented it with key = hash(subtype), and the tests failed on travis and appveyor, so it is something that comes up.

It looked like the dtypes that failed for str and hash were mutually exclusive, so making the key a combination of str and hash seemed like it would work. Certainly open to other solutions though, as it's not the cleanest method. Or should we just revert to str and explicitly not support CategoricalDtype as a subtype? Not sure how often that would actually come up.

i = IntervalDtype('interval')
assert i.subtype == ''
@pytest.mark.parametrize('subtype', [
'interval[int64]', 'Interval[int64]', 'int64', np.dtype('int64')])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, I though we only accept interval[...] and not Interval[...]?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The existing behavior allows for Interval[...]. Really just took those subtypes out of a for loop and used parametrize instead in an effort to clean up the test. The existing regex match string includes "I", and there is existing code checks against .title(), so it looks like this was intentional:

_match = re.compile(r"(I|i)nterval\[(?P<subtype>.+)\]")

if isinstance(other, compat.string_types):
return other == self.name or other == self.name.title()

I think this was copied from PeriodDtype, because I see nearly identical code there. Doesn't look like it'd be too hard to remove this to enforce lowercase, if so desired.


i = IntervalDtype()
@pytest.mark.parametrize('subtype', [None, 'interval', 'interval[]'])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interval[] prob doesn't make sense, is there a reason to keep it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, just put that there for legacy purposes, since it was previously a valid representation:

In [4]: IntervalDtype('interval')
Out[4]: interval[]

Agree that it doesn't really make sense, and can remove it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

@codecov
Copy link

codecov bot commented Dec 30, 2017

Codecov Report

Merging #18997 into master will increase coverage by 0.03%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #18997      +/-   ##
==========================================
+ Coverage   91.51%   91.54%   +0.03%     
==========================================
  Files         148      148              
  Lines       48780    48777       -3     
==========================================
+ Hits        44639    44652      +13     
+ Misses       4141     4125      -16
Flag Coverage Δ
#multiple 89.91% <100%> (+0.03%) ⬆️
#single 41.6% <27.27%> (ø) ⬆️
Impacted Files Coverage Δ
pandas/core/dtypes/dtypes.py 96.11% <100%> (+0.79%) ⬆️
pandas/core/dtypes/common.py 94.54% <0%> (+0.24%) ⬆️
pandas/plotting/_converter.py 66.95% <0%> (+1.73%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d5194e5...bd9805e. Read the comment docs.

@jreback
Copy link
Contributor

jreback commented Dec 30, 2017

is there utility in having an IntervalIndex that works on objects / categorical types generally? (as opposed to numeric [float,int,datetime]).

@jschendel
Copy link
Member Author

is there utility in having an IntervalIndex that works on objects / categorical types generally?

I've been wondering the same thing, and have been going back and forth. It seems like we'd need to put in a fair number of guardrails for dtypes that aren't continuous (or continuous enough) where things like mid and length aren't guaranteed to be defined. I've been asked in other PR's to handle such edge cases, e.g. #18805 (comment). I'm not sure I'd classify that as demand, so much as reviewers just trying to ensure that all corners are covered.

As a source of comparison, postgres range types only provide numeric support by default. However, postgres also makes it fairly easy to extend this to non-numeric types, e.g.

CREATE TYPE inetrange AS RANGE (
    SUBTYPE = inet
);

would provide support for intervals of IP addresses. I don't think there'd be such easy extensions in our case if we only provided numeric support?

In terms of use cases, everything that I'd use intervals for would be covered by numeric. I can imagine some scenarios where non-numeric could be useful though. For example, intervals with string endpoints essentially allows for prefix level searches:

In [2]: iv = pd.Interval('ab', 'bas')

In [3]: 'abracadabra' in iv
Out[3]: True

In [4]: 'bar' in iv
Out[4]: True

In [5]: 'bass' in iv
Out[5]: False

Some quick searching turns up a postgres extension that appears do this. The real world use case mentioned there is telephony applications. I could maybe see similar prefix stuff being done in a biology/bioinformatics context, but really only have passing knowledge of that area, so could be wrong.

I could see applications for ordered categoricals with categories along the lines of low/medium/high (but more categories), where you'd have logical intervals to partition by. Many problems could probably be solved via alternative approaches though, e.g. groupby logic. Not sure if there'd be cases where Interval support would be far and away a superior solution. You could probably even solve such problems without explicit Categorical support by mapping label <--> integer, creating integer intervals, and mapping back as appropriate, though kind of seems like reinventing the Categorical wheel.

All that being said, I don't know how prevalent these use cases would actually be. Could very well not be worth the effort to support such operations.

@jreback
Copy link
Contributor

jreback commented Dec 30, 2017

cc @zfrenchee any thoughts on the object / categorical (non-numeric) case for intervals?

@alexlenail
Copy link
Contributor

alexlenail commented Dec 30, 2017

@jreback @jschendel I can't think of a use case for categorical intervals in my work, but that doesn't mean it can't exist. The telephony / ip address idea seems like it would be reasonable.

The prefix search is interesting, but I imagine if you wanted a prefix tree, you wouldn't create a multi-intervalindex df (though it might work), but instead use a prefix-tree package.

I might be cautious with permitting IntervalIndex of arbitrary subtype. Having a complex type system is nice when it works, but I wouldn't want to have to debug issues people brought me involving multi-IntervalIndexes of exotic subtypes. Not sure that's a good reason though.

@jreback
Copy link
Contributor

jreback commented Dec 31, 2017

I agree. @jschendel if you wouldn't mind in this (or maybe better other) PR. to limit IntervalIndex dtypes that it will accept. (e.g. I think all except object, categorical). Then this PR becomes much simpler.

@jschendel
Copy link
Member Author

Updated to remove code for unsupported dtypes now that #19022 is merged. Also removed parsing 'interval[]' for IntervalDtype construction.

@jreback jreback added this to the 0.23.0 milestone Jan 7, 2018
@jreback
Copy link
Contributor

jreback commented Jan 7, 2018

lgtm. @TomAugspurger if you'd have a look.

@TomAugspurger TomAugspurger merged commit 982e112 into pandas-dev:master Jan 10, 2018
maximveksler pushed a commit to maximveksler/pandas that referenced this pull request Jan 11, 2018
* Fix IntervalDtype Bugs and Inconsistencies

* remove code for unsupported dtypes and remove 'interval[]'
@jschendel jschendel deleted the intervaldtype-fixes branch September 24, 2018 17:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Bug Dtype Conversions Unexpected or buggy dtype conversions Interval Interval data type
Projects
None yet
Development

Successfully merging this pull request may close these issues.

IntervalDtype inconsistencies and bugs
4 participants