You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
importnumpyasnpimportpandasaspd# construct IntervalIndex from timestamps# --> Length of index is > 100, the default leaf_size# --> index is heavily skewed towards Timstamp.maxidx=pd.IntervalIndex.from_arrays(pd.date_range('2018-01-01', '2018-12-31', freq='D'), [pd.Timestamp.max]*365)
# type conversion from IntervalIndex._engineleft=idx._maybe_convert_i8(idx.left)
right=idx._maybe_convert_i8(idx.right)
# pivot point calculation from IntervalNode constructor, when n_elements > leaf_sizepivot=np.median(left+right) /2print(pd.Timestamp(pivot))
# >>> 1848-02-11 00:06:21.572612096# This is below the minimum left-side value, which should not beprint(pd.Timestamp(np.median(idx.mid.asi8)))
# >>> 2140-05-21 23:53:38.427387904# This is the correct midpoint# NB: with my own data, accessing `mid` throws OverflowError: Overflow in int64 addition# proposed alternate logic for midpointalternate=np.median(left/2+right/2)
print(pd.Timestamp(alternate)
# >>> 2140-05-21 23:53:38.427387904# Matches the correct midpoint, and should never over/under flow
Description
Current behavior causes a hard crash in the python process under certain conditions:
Length of the index is > 100, the default leaf_size of the IntervalTree
The intervals are heavily skewed towards the upper bound of the type (for signed types I believe this can also be the lower bound, but I haven't verified it)
Having an index length greater than the leaf_size parameter, causes the the IndexNode constructor to create child nodes. Currently the default leaf_size is 100 and is not exposed to the user API. Instantiating an IntervalTree directly and setting the leaf_size > index length does not lead to a crash.
By emulating the IntervalNode.classify_intervals method in the REPL, I found that the entire index was assigned to the right-hand child node, because the pivot was calculated on an array of overflowed integers, and that the constructor would never stop recursing.
As an additional note, I also encountered an OverflowError when accessing the mid attribute of the IntervalIndex with my own data, but this does not occur in the code sample above. I traced it back through the length attribute, where left bounds are subtracted from right bounds.
idx.right-idx.left# >>> Traceback (most recent call last):# >>> File "<input>", line 1, in <module># >>> File "~\pandas\core\indexes\datetimelike.py", line 501, in __sub__# >>> result = self._data.__sub__(maybe_unwrap_index(other))# >>> File "~\pandas\core\arrays\datetimelike.py", line 1275, in __sub__# >>> result = self._sub_datetime_arraylike(other)# >>> File "~\pandas\core\arrays\datetimes.py", line 724, in _sub_datetime_arraylike# >>> arr_mask=self._isnan)# >>> File "~\pandas\core\algorithms.py", line 938, in checked_add_with_arr# >>> raise OverflowError("Overflow in int64 addition")# >>> OverflowError: Overflow in int64 addition
Expected Output
Changing the pivot calculation from self.pivot = np.median(left + right) / 2
to self.pivot = np.median(left/2 + right/2)
should protect against over and underflows.
@kingsykes : Thanks for reporting this! Unless there is pushback, please feel free to open a PR with your patches and add a test to confirm working behavior.
I've created a PR (#25498) with your proposed change to the pivot calculation. I've also created a separate issue (#25499) for your additional note on the mid attribute in order to have that fully documented in a separate location.
Code Sample:
Description
Current behavior causes a hard crash in the python process under certain conditions:
Having an index length greater than the leaf_size parameter, causes the the IndexNode constructor to create child nodes. Currently the default leaf_size is 100 and is not exposed to the user API. Instantiating an IntervalTree directly and setting the leaf_size > index length does not lead to a crash.
By emulating the
IntervalNode.classify_intervals
method in the REPL, I found that the entire index was assigned to the right-hand child node, because the pivot was calculated on an array of overflowed integers, and that the constructor would never stop recursing.As an additional note, I also encountered an
OverflowError
when accessing themid
attribute of theIntervalIndex
with my own data, but this does not occur in the code sample above. I traced it back through thelength
attribute, where left bounds are subtracted from right bounds.Expected Output
Changing the pivot calculation from
self.pivot = np.median(left + right) / 2
to
self.pivot = np.median(left/2 + right/2)
should protect against over and underflows.
Output of
pd.show_versions()
The text was updated successfully, but these errors were encountered: