Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: ensure IntervalIndex.left/right are 64bit if numeric, part II #50195

Merged
merged 7 commits into from
Jan 10, 2023

Conversation

topper-123
Copy link
Contributor

@topper-123 topper-123 commented Dec 12, 2022

  • closes #xxxx (Replace xxxx with the GitHub issue number)
  • Tests added and passed if fixing a bug or adding a new feature
  • All code checks passed.
  • Added type annotations to new arguments/methods/functions.
  • Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

Follow-up to #50130. It turned out that IntervalArray.from_array could get around the 64bit requirement, so we fix that by moving maybe_convert_numeric_to_64bit and and using it in the IntervalArray constructor also.

Also return the 64bit index in IntervalIndex._maybe_convert_i8, previously we returned original, which was the not-64bit-converted one...(this changes a test, but it’s just for an internal method).

@@ -284,7 +306,10 @@ def _simple_new(
from pandas.core.indexes.base import ensure_index

left = ensure_index(left, copy=copy)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to handle this somehow in the creation of the IntervalIndex?

Copy link
Contributor Author

@topper-123 topper-123 Dec 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to remove ensure_index in various ways, but failed. It looks like there is some dtype issues that need to passed through an Index to be solved, but I didn't manage to untangle it, unfortunately.

There is a lot going on in IntervalArray._simple_new. I've looked into moving all the validation/dtype wrangling there into a separate function. That would mean that _simple_new would become much more simple and it would much simpler to instantiate an IntervalArray, when we can be sure the input data is correct. I'll push a PR about this shortly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did change this in the newest version, could you take a look?

@MarcoGorelli MarcoGorelli self-requested a review December 16, 2022 10:09
Comment on lines 440 to 442
assert result is key
if not isinstance(result, NumericIndex):
assert result is key
else:
expected = NumericIndex(key)
tm.assert_index_equal(result, expected)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rather than adding logic to the test (which can hide bugs), is it possible to either:

  • include the expected result in the parametrisation
  • OR split the test out into two separate ones, one of which uses assert result is key and the other tm.assert_index_equal(result, expected)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, doesn't this test also pass on upstream/main? Is there a way to write it such that it fails there, but passes here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't get time to address this question right now, sorry. I'll get back to this tonight.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I`ve updated the PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for updating - any chance you could address the logic in the test comment too please?

IIRC when I tried executing this, it was just one of the make_key inputs which required a different assertion

If so, then can the assertion either be included in the parametrisation, or the test be split into two?

For reference, this advice comes from: https://testing.googleblog.com/2014/07/testing-on-toilet-dont-put-logic-in.html

Copy link
Contributor Author

@topper-123 topper-123 Dec 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah good points in the article, very nice to have it articulated.

I've changed the PR. I started by separating the test into two tests split by type of make_key. However, I didn't like having two very similar tests and I didn't like having lambdas in parametrization, (because lambdas are difficult to introspect, and having several lambdas means it's difficult to see which test you're looking at when debugging). so I've made a new version.

I prefer the newest version (avoiding lambdas, clear inputs into the test function), but will await your comment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, looks better, thanks for updating

@@ -284,7 +306,10 @@ def _simple_new(
from pandas.core.indexes.base import ensure_index

left = ensure_index(left, copy=copy)
left = maybe_convert_numeric_to_64bit(left)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just for my understanding, what's an example of where this makes a difference? the test you've modifed passes even without this change

Copy link
Contributor Author

@topper-123 topper-123 Dec 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I took the wrong approach here originally.

The issue that this was supposed to solve is that on 32-bit systems e.g. IntervalArray.from_breaks([1, 2, 3]) should give an array with dtype interval[int64, right] to align with the convention in pandas that lists in constructors should interpreted as 64-bit (e.g. Series([1, 2, 3]) and Index([1, 2, 3]) both give int64 dtype even on 32-bit systems). Previously, (after #49560) giving lists to IntervalArray gave interval[int32, right]. This affected some tests in #49560 which is the reason I have taken this up.

In the newest version I moved this logic to a _maybe_convert_platform_interval, which IMO should be the better location for this.

Copy link
Contributor Author

@topper-123 topper-123 Dec 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, this issue on 32-bit systems is only with integer dtypes as e.g. np.asarray([1.5]) will always have float64 dtype. So I've made this simpler in the newest version by just checking for integer dtype and converting to int64 if needed.

Comment on lines 186 to 187
if not is_array_like(arr):
return arr
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just for my understanding, what's an example of where this makes a difference?

Copy link
Contributor Author

@topper-123 topper-123 Dec 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was just a short-circuit, so the functions breaks early, if the value can't possibly be array-like, it made no funcional difference. This will overall probably not be an improvement as the arr in the current version now can't be non-array plus the function is typed, so I can remove it again.

I moved the function to core.dtypes.cast and renamed it maybe_upcast_numeric_to_64bit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah would remove this if not necessary. This would get me guessing how to get here if I would want to make a change in a couple of weeks

@topper-123
Copy link
Contributor Author

Sorry for the late response, I had some things I had to attend to in the weekend.

I respond to you comments individually above. I did look into this again and agree that some of the suggestions in my original PR could be improved upon (especially the changes to ÌntervalArray) and I've uploaded a new version (with rebase).

@topper-123
Copy link
Contributor Author

The failed check is unrelated.

@topper-123
Copy link
Contributor Author

Rebased to make the CI run again. No other changes have been made.

@topper-123
Copy link
Contributor Author

Ping.

As far as I see all comments have been addressed?

Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for sticking with this

My only comment is about

    if not is_array_like(arr):
        return arr

, if this isn't covered by any tests, then TBH I'd prefer to keep it out

Other than that, I don't have any objections, but I'm not familiar enough with this part of the codebase to merge, so I'll hand over to @phofl

@topper-123
Copy link
Contributor Author

👍 I've remove that code section.

Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating! No objections - approving to remove my 'requested changes', but handing over to others with more expertise in this before merging

def maybe_convert_numeric_to_64bit(arr: NumpyIndexT) -> NumpyIndexT:
# IntervalTree only supports 64 bit numpy array
dtype = arr.dtype
if not np.issubclass_(dtype.type, np.number):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this different from is_numeric_dtype?

@topper-123
Copy link
Contributor Author

Ping. I've rebased because this has been standing still for a bit.

@phofl phofl merged commit 939d0ba into pandas-dev:main Jan 10, 2023
@phofl
Copy link
Member

phofl commented Jan 10, 2023

thx @topper-123

@topper-123 topper-123 deleted the IntervalIndex2 branch January 10, 2023 15:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Interval Interval data type
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants