Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: .nlargest with unsigned integers #21426

Closed
eoincondron opened this issue Jun 11, 2018 · 13 comments · Fixed by #21432
Closed

BUG: .nlargest with unsigned integers #21426

eoincondron opened this issue Jun 11, 2018 · 13 comments · Fixed by #21432
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Bug Dtype Conversions Unexpected or buggy dtype conversions good first issue
Milestone

Comments

@eoincondron
Copy link

eoincondron commented Jun 11, 2018

Code Sample, a copy-pastable example if possible

pd.Series(np.array([0, 0, 0, 100, 1000, 10000, 100], dtype='uint32')).nlargest(5) 
0        0
1        0
2        0
5    10000
4     1000

Problem description

nlargest favours 0 above positive values. Common to both uint32 and uint64 types and possibly others.

Expected Output

5    10000
4     1000
3      100
6      100
0        0
dtype: uint32

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.2.final.0
python-bits: 64
OS: Linux
OS-release: 3.10.0-327.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8
LOCALE: en_GB.UTF-8

pandas: 0.20.3
pytest: 3.2.1
pip: 9.0.1
setuptools: 36.4.0
Cython: None
numpy: 1.13.1
scipy: 0.19.1
xarray: None
IPython: 6.1.0
sphinx: None
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: None
tables: 3.4.2
numexpr: 2.6.2
feather: None
matplotlib: 2.0.2
openpyxl: 2.4.8
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.9999999
sqlalchemy: 1.1.13
pymysql: 0.7.9.None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None

@gfyoung gfyoung added Bug Dtype Conversions Unexpected or buggy dtype conversions labels Jun 11, 2018
@gfyoung
Copy link
Member

gfyoung commented Jun 11, 2018

How strange! Let's figure out what's going on there...

@jschendel
Copy link
Member

jschendel commented Jun 11, 2018

I suspect the issue is with this block of code:

arr, _, _ = _ensure_data(dropped.values)
if method == 'nlargest':
arr = -arr

Specifically for uint data, I don't think -arr behaves as intended:

In [2]: arr = np.array([0, 0, 0, 100, 1000, 10000, 100], dtype='uint64')

In [3]: -arr
Out[3]:
array([                   0,                    0,                    0,
       18446744073709551516, 18446744073709550616, 18446744073709541616,
       18446744073709551516], dtype=uint64)

@gfyoung gfyoung added the Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff label Jun 11, 2018
@gfyoung
Copy link
Member

gfyoung commented Jun 11, 2018

@jschendel : That would make sense. Hacky solution is to cast to int64, but then we lose half of the uint64 range, which isn't ideal. Should be possible to do without the hack.

@jschendel
Copy link
Member

I'm not all that well-versed regarding uint operations, but does simply doing -a - 1 suffice for the uint case? It looks -a - 1 properly reverses order, which appears to be the intention of the code:

In [2]: a = np.array([0, 1, 18446744073709551615], dtype='uint64')

In [3]: a
Out[3]: array([                   0,                    1, 18446744073709551615], dtype=uint64)

In [4]: -a
Out[4]: array([                   0, 18446744073709551615,                    1], dtype=uint64)

In [5]: -a - 1
Out[5]: array([18446744073709551615, 18446744073709551614,                    0], dtype=uint64)

@gfyoung
Copy link
Member

gfyoung commented Jun 11, 2018

Hmm...that actually might work, both for the uint and int cases in fact. Definitely would need to test the boundary cases for various (u)int dtypes to ensure correctness.

@gfyoung gfyoung added this to the 0.23.2 milestone Jun 11, 2018
@gfyoung
Copy link
Member

gfyoung commented Jun 11, 2018

Marking this for 0.23.2, as this is certainly patch-able in that time frame (I would be able to patch it if isn't resolved by then).

@alimcmaster1
Copy link
Member

Hi! I would like to pick this up if this is a good first issue? ( Feel free to assign to me )

@gfyoung
Copy link
Member

gfyoung commented Jun 11, 2018

@alimcmaster1 : Go for it!

@jschendel
Copy link
Member

I'm actually right about to put a fix in for this

@gfyoung
Copy link
Member

gfyoung commented Jun 11, 2018

@jschendel : Thanks for letting us know!

@alimcmaster1 : Sorry about that. 😞 But you're more than welcome to review the PR that @jschendel puts up soon-ish.

@alimcmaster1
Copy link
Member

No worries! Thanks can do :)

@jschendel
Copy link
Member

Created the PR. As another example, note that this fails for int64 on 0.23.0 as well:

In [2]: pd.__version__
Out[2]: '0.23.0'

In [3]: s = pd.Series([-9223372036854775808, 0, 9223372036854775807])

In [4]: s
Out[4]:
0   -9223372036854775808
1                      0
2    9223372036854775807
dtype: int64

In [5]: s.nlargest(2)
Out[5]:
0   -9223372036854775808
2    9223372036854775807
dtype: int64

@gfyoung
Copy link
Member

gfyoung commented Jun 11, 2018

Sigh...that's symptomatic of the same overflow issue presented with uint. Good catch!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Bug Dtype Conversions Unexpected or buggy dtype conversions good first issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants