BUG: Infinite recursion on IntervalIndex constructor due to integer overflow #25485

kingsykes · 2019-02-28T21:21:37Z

Code Sample:

import numpy as np
import pandas as pd

# construct IntervalIndex from timestamps
# --> Length of index is > 100, the default leaf_size
# --> index is heavily skewed towards Timstamp.max
idx = pd.IntervalIndex.from_arrays(pd.date_range('2018-01-01', '2018-12-31', freq='D'), [pd.Timestamp.max]*365)

# type conversion from IntervalIndex._engine
left = idx._maybe_convert_i8(idx.left)
right = idx._maybe_convert_i8(idx.right)

# pivot point calculation from IntervalNode constructor, when n_elements > leaf_size
pivot = np.median(left + right) / 2

print(pd.Timestamp(pivot))
# >>> 1848-02-11 00:06:21.572612096
# This is below the minimum left-side value, which should not be

print(pd.Timestamp(np.median(idx.mid.asi8)))
# >>> 2140-05-21 23:53:38.427387904
# This is the correct midpoint
# NB: with my own data, accessing `mid` throws OverflowError: Overflow in int64 addition

# proposed alternate logic for midpoint
alternate = np.median(left/2 + right/2)

print(pd.Timestamp(alternate)
# >>> 2140-05-21 23:53:38.427387904
# Matches the correct midpoint, and should never over/under flow

Description

Current behavior causes a hard crash in the python process under certain conditions:

Length of the index is > 100, the default leaf_size of the IntervalTree
The intervals are heavily skewed towards the upper bound of the type (for signed types I believe this can also be the lower bound, but I haven't verified it)

Having an index length greater than the leaf_size parameter, causes the the IndexNode constructor to create child nodes. Currently the default leaf_size is 100 and is not exposed to the user API. Instantiating an IntervalTree directly and setting the leaf_size > index length does not lead to a crash.

By emulating the IntervalNode.classify_intervals method in the REPL, I found that the entire index was assigned to the right-hand child node, because the pivot was calculated on an array of overflowed integers, and that the constructor would never stop recursing.

As an additional note, I also encountered an OverflowError when accessing the mid attribute of the IntervalIndex with my own data, but this does not occur in the code sample above. I traced it back through the length attribute, where left bounds are subtracted from right bounds.

idx.right - idx.left
# >>> Traceback (most recent call last):
# >>>   File "<input>", line 1, in <module>
# >>>   File "~\pandas\core\indexes\datetimelike.py", line 501, in __sub__
# >>>     result = self._data.__sub__(maybe_unwrap_index(other))
# >>>   File "~\pandas\core\arrays\datetimelike.py", line 1275, in __sub__
# >>>     result = self._sub_datetime_arraylike(other)
# >>>   File "~\pandas\core\arrays\datetimes.py", line 724, in _sub_datetime_arraylike
# >>>     arr_mask=self._isnan)
# >>>   File "~\pandas\core\algorithms.py", line 938, in checked_add_with_arr
# >>>     raise OverflowError("Overflow in int64 addition")
# >>> OverflowError: Overflow in int64 addition

Expected Output

Changing the pivot calculation from
self.pivot = np.median(left + right) / 2
to
self.pivot = np.median(left/2 + right/2)
should protect against over and underflows.

Output of `pd.show_versions()`

INSTALLED VERSIONS
------------------
commit: None
python: 3.5.6.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 63 Stepping 2, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None
------------------
pandas: 0.24.1
pytest: None
pip: 10.0.1
setuptools: 40.2.0
Cython: None
numpy: 1.16.1
scipy: None
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: None
sqlalchemy: 1.1.13
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

The text was updated successfully, but these errors were encountered:

gfyoung · 2019-03-01T01:43:48Z

cc @TomAugspurger

@kingsykes : Thanks for reporting this! Unless there is pushback, please feel free to open a PR with your patches and add a test to confirm working behavior.

jschendel · 2019-03-01T08:54:31Z

Thanks for the detailed report @kingsykes!

I've created a PR (#25498) with your proposed change to the pivot calculation. I've also created a separate issue (#25499) for your additional note on the mid attribute in order to have that fully documented in a separate location.

gfyoung added Numeric Operations Arithmetic, Comparison, and Logical operations Interval Interval data type labels Mar 1, 2019

This was referenced Mar 1, 2019

BUG: Fix RecursionError during IntervalTree construction #25498

Merged

BUG: Overflow in IntervalIndex/IntervalArray mid/length #25499

Open

jreback modified the milestones: 0.25.0, 0.24.2 Mar 3, 2019

jreback closed this as completed in #25498 Mar 3, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Infinite recursion on IntervalIndex constructor due to integer overflow #25485

BUG: Infinite recursion on IntervalIndex constructor due to integer overflow #25485

kingsykes commented Feb 28, 2019 •

edited

Loading

gfyoung commented Mar 1, 2019 •

edited

Loading

jschendel commented Mar 1, 2019

BUG: Infinite recursion on IntervalIndex constructor due to integer overflow #25485

BUG: Infinite recursion on IntervalIndex constructor due to integer overflow #25485

Comments

kingsykes commented Feb 28, 2019 • edited Loading

Code Sample:

Description

Expected Output

Output of pd.show_versions()

gfyoung commented Mar 1, 2019 • edited Loading

jschendel commented Mar 1, 2019

kingsykes commented Feb 28, 2019 •

edited

Loading

Output of `pd.show_versions()`

gfyoung commented Mar 1, 2019 •

edited

Loading