BUG: .nlargest with unsigned integers #21426

eoincondron · 2018-06-11T15:41:31Z

Code Sample, a copy-pastable example if possible

pd.Series(np.array([0, 0, 0, 100, 1000, 10000, 100], dtype='uint32')).nlargest(5) 
0        0
1        0
2        0
5    10000
4     1000

Problem description

nlargest favours 0 above positive values. Common to both uint32 and uint64 types and possibly others.

Expected Output

5    10000
4     1000
3      100
6      100
0        0
dtype: uint32

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.6.2.final.0
python-bits: 64
OS: Linux
OS-release: 3.10.0-327.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8
LOCALE: en_GB.UTF-8

pandas: 0.20.3
pytest: 3.2.1
pip: 9.0.1
setuptools: 36.4.0
Cython: None
numpy: 1.13.1
scipy: 0.19.1
xarray: None
IPython: 6.1.0
sphinx: None
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: None
tables: 3.4.2
numexpr: 2.6.2
feather: None
matplotlib: 2.0.2
openpyxl: 2.4.8
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.9999999
sqlalchemy: 1.1.13
pymysql: 0.7.9.None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

gfyoung · 2018-06-11T17:37:05Z

How strange! Let's figure out what's going on there...

jschendel · 2018-06-11T19:38:24Z

I suspect the issue is with this block of code:

pandas/pandas/core/algorithms.py

Lines 1136 to 1138 in 4807905

    
           arr, _, _ = _ensure_data(dropped.values) 
        
           if method == 'nlargest': 
        
               arr = -arr

Specifically for uint data, I don't think -arr behaves as intended:

In [2]: arr = np.array([0, 0, 0, 100, 1000, 10000, 100], dtype='uint64')

In [3]: -arr
Out[3]:
array([                   0,                    0,                    0,
       18446744073709551516, 18446744073709550616, 18446744073709541616,
       18446744073709551516], dtype=uint64)

gfyoung · 2018-06-11T19:47:59Z

@jschendel : That would make sense. Hacky solution is to cast to int64, but then we lose half of the uint64 range, which isn't ideal. Should be possible to do without the hack.

jschendel · 2018-06-11T20:24:02Z

I'm not all that well-versed regarding uint operations, but does simply doing -a - 1 suffice for the uint case? It looks -a - 1 properly reverses order, which appears to be the intention of the code:

In [2]: a = np.array([0, 1, 18446744073709551615], dtype='uint64')

In [3]: a
Out[3]: array([                   0,                    1, 18446744073709551615], dtype=uint64)

In [4]: -a
Out[4]: array([                   0, 18446744073709551615,                    1], dtype=uint64)

In [5]: -a - 1
Out[5]: array([18446744073709551615, 18446744073709551614,                    0], dtype=uint64)

gfyoung · 2018-06-11T20:51:25Z

Hmm...that actually might work, both for the uint and int cases in fact. Definitely would need to test the boundary cases for various (u)int dtypes to ensure correctness.

gfyoung · 2018-06-11T21:00:28Z

Marking this for 0.23.2, as this is certainly patch-able in that time frame (I would be able to patch it if isn't resolved by then).

alimcmaster1 · 2018-06-11T23:01:52Z

Hi! I would like to pick this up if this is a good first issue? ( Feel free to assign to me )

gfyoung · 2018-06-11T23:04:19Z

@alimcmaster1 : Go for it!

jschendel · 2018-06-11T23:10:33Z

I'm actually right about to put a fix in for this

gfyoung · 2018-06-11T23:12:44Z

@jschendel : Thanks for letting us know!

@alimcmaster1 : Sorry about that. 😞 But you're more than welcome to review the PR that @jschendel puts up soon-ish.

alimcmaster1 · 2018-06-11T23:19:05Z

No worries! Thanks can do :)

jschendel · 2018-06-11T23:38:19Z

Created the PR. As another example, note that this fails for int64 on 0.23.0 as well:

In [2]: pd.__version__
Out[2]: '0.23.0'

In [3]: s = pd.Series([-9223372036854775808, 0, 9223372036854775807])

In [4]: s
Out[4]:
0   -9223372036854775808
1                      0
2    9223372036854775807
dtype: int64

In [5]: s.nlargest(2)
Out[5]:
0   -9223372036854775808
2    9223372036854775807
dtype: int64

gfyoung · 2018-06-11T23:43:58Z

Sigh...that's symptomatic of the same overflow issue presented with uint. Good catch!

gfyoung added Bug Dtype Conversions Unexpected or buggy dtype conversions labels Jun 11, 2018

gfyoung added the Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff label Jun 11, 2018

gfyoung added this to the 0.23.2 milestone Jun 11, 2018

gfyoung added the good first issue label Jun 11, 2018

jschendel mentioned this issue Jun 11, 2018

BUG: Fix Series.nlargest for integer boundary values #21432

Merged

4 tasks

jreback changed the title ~~nlargest bug with unsigned integers~~ BUG: .nlargest with unsigned integers Jun 12, 2018

jreback closed this as completed in #21432 Jun 15, 2018

rileymcdowell mentioned this issue Sep 18, 2018

Unexpected output for nlargest function with multiple columns #22752

Closed

bpieper26 mentioned this issue Apr 20, 2019

Nlargest on boolean return False first #26154

Closed

2 tasks

karldw mentioned this issue Oct 15, 2019

nlargest gives a zero-row dataframe when ordering columns are all NaN #28984

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: .nlargest with unsigned integers #21426

BUG: .nlargest with unsigned integers #21426

eoincondron commented Jun 11, 2018 •

edited by jschendel

Loading

INSTALLED VERSIONS

gfyoung commented Jun 11, 2018

jschendel commented Jun 11, 2018 •

edited

Loading

gfyoung commented Jun 11, 2018 •

edited

Loading

jschendel commented Jun 11, 2018

gfyoung commented Jun 11, 2018

gfyoung commented Jun 11, 2018

alimcmaster1 commented Jun 11, 2018

gfyoung commented Jun 11, 2018

jschendel commented Jun 11, 2018

gfyoung commented Jun 11, 2018

alimcmaster1 commented Jun 11, 2018

jschendel commented Jun 11, 2018

gfyoung commented Jun 11, 2018

BUG: .nlargest with unsigned integers #21426

BUG: .nlargest with unsigned integers #21426

Comments

eoincondron commented Jun 11, 2018 • edited by jschendel Loading

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

gfyoung commented Jun 11, 2018

jschendel commented Jun 11, 2018 • edited Loading

gfyoung commented Jun 11, 2018 • edited Loading

jschendel commented Jun 11, 2018

gfyoung commented Jun 11, 2018

gfyoung commented Jun 11, 2018

alimcmaster1 commented Jun 11, 2018

gfyoung commented Jun 11, 2018

jschendel commented Jun 11, 2018

gfyoung commented Jun 11, 2018

alimcmaster1 commented Jun 11, 2018

jschendel commented Jun 11, 2018

gfyoung commented Jun 11, 2018

eoincondron commented Jun 11, 2018 •

edited by jschendel

Loading

Output of `pd.show_versions()`

jschendel commented Jun 11, 2018 •

edited

Loading

gfyoung commented Jun 11, 2018 •

edited

Loading