Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gh 36562 typeerror comparison not supported between float and str #37096

Merged
merged 23 commits into from
Nov 4, 2020
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
1320ff1
Fixed #36562
ssche Oct 13, 2020
a1b9385
Merge remote-tracking branch 'upstream/master' into gh-36562-typeerro…
ssche Oct 13, 2020
1f76d21
Fixed flake issue
ssche Oct 13, 2020
7076841
fixed linting errors
ssche Oct 13, 2020
225675d
Addressed review comments
ssche Oct 14, 2020
2677166
Make `sort_tuples` last resort sorting strategy to allow faster sorte…
ssche Oct 16, 2020
6f476bc
Merge remote-tracking branch 'upstream/master' into gh-36562-typeerro…
ssche Oct 16, 2020
3688238
manually apply pandas black changes (from CI)
ssche Oct 16, 2020
aba429c
Merge remote-tracking branch 'upstream/master' into gh-36562-typeerro…
ssche Oct 16, 2020
8ae9279
Modified changelog message as per review comments
ssche Oct 21, 2020
dd5a38d
Merge remote-tracking branch 'upstream/master' into gh-36562-typeerro…
ssche Oct 21, 2020
7b7d6f8
Removed pd.xyz
ssche Oct 22, 2020
4226662
Merge remote-tracking branch 'upstream/master' into gh-36562-typeerro…
ssche Oct 22, 2020
8cbfa01
forgot `pd`
ssche Oct 22, 2020
6d71000
Merge remote-tracking branch 'upstream/master' into gh-36562-typeerro…
ssche Oct 26, 2020
92e1e33
Changed sorting algorithm by using expanded array
ssche Oct 31, 2020
3ada9ce
Merge remote-tracking branch 'upstream/master' into gh-36562-typeerro…
ssche Oct 31, 2020
95670f1
Add ticket ref to test case
ssche Oct 31, 2020
d9dcd22
sort import
ssche Oct 31, 2020
66c30c7
Address review comments
ssche Oct 31, 2020
db66528
Merge remote-tracking branch 'upstream/master' into gh-36562-typeerro…
ssche Oct 31, 2020
16ae4f4
Merge remote-tracking branch 'upstream/master' into gh-36562-typeerro…
ssche Nov 3, 2020
9d03dc3
Fixed: committed file was in merge state
ssche Nov 3, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v1.2.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -409,6 +409,7 @@ MultiIndex

- Bug in :meth:`DataFrame.xs` when used with :class:`IndexSlice` raises ``TypeError`` with message ``"Expected label or tuple of labels"`` (:issue:`35301`)
- Bug in :meth:`DataFrame.reset_index` with ``NaT`` values in index raises ``ValueError`` with message ``"cannot convert float NaN to integer"`` (:issue:`36541`)
- Bug in :meth:`DataFrame.combine_first` when used with :class:`MultiIndex` containing string and ``NaN`` values raises ``TypeError`` with message ``"'<' not supported between instances of 'float' and 'str'"`` (:issue:`36562`)
ssche marked this conversation as resolved.
Show resolved Hide resolved
-

I/O
Expand Down
48 changes: 44 additions & 4 deletions pandas/core/algorithms.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
"""
from __future__ import annotations

import functools
import operator
from textwrap import dedent
from typing import TYPE_CHECKING, Dict, Optional, Tuple, Union, cast
Expand Down Expand Up @@ -2055,13 +2056,52 @@ def sort_mixed(values):
strs = np.sort(values[str_pos])
return np.concatenate([nums, np.asarray(strs, dtype=object)])

def sort_tuples(values):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you simply try to remove the nan's from values and add them to the end is fine and just fix up sort_mixed, we do this in a number of places e.g. something like

In [204]: arr = np.array(['foo', 3, np.nan], dtype=object)                                                                                                                                                             

In [205]: mask = pd.notna(arr)                                                                                                                                                                                         

In [206]: arr.take(np.arange(len(arr))[mask])                                                                                                                                                                          
Out[206]: array(['foo', 3], dtype=object)

Copy link
Contributor Author

@ssche ssche Oct 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started off with trying to fix sort_mixed and removed nans as you suggested and would have loved to use it instead. however the structure of the array doesn't allow for that (we are talking about a 1-d array of tuples, i.e. dtype object, not an N-d array which would allow for all those vector operations to be applicable).

here's the data of values that gets passed in to sort_tuples() when running the test I added:

In[2]: 
values
Out[2]: 
array([('b', 1), ('b', 2), ('c', 3), ('a', 4), ('b', 5), (nan, 6),
       ('a', 1), ('c', 1), ('d', 1)], dtype=object)
values[0]
Out[3]: ('b', 1)
values.shape
Out[4]: (9,)

I could convert value to a N-d array (and use sort_mixed), but that solution would come with its own overhead costs...

In[5]: 
np.asarray(list(values))
Out[5]: 
array([['b', '1'],
       ['b', '2'],
       ['c', '3'],
       ['a', '4'],
       ['b', '5'],
       ['nan', '6'],
       ['a', '1'],
       ['c', '1'],
       ['d', '1']], dtype='<U3')

your call.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any feedback on the above, @jreback?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i care about code complexity here, this is adding a lot, try this

# sorts tuples with mixed values. can handle nan vs string comparisons.
def cmp_func(index_x, index_y):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could do cmp_func(left, right)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes let's change the name and please add typing for cmp_func and sort_tuples. (sort_mixed if you can as well :->)

x = values[index_x]
y = values[index_y]
if x == y:
return 0
len_x = len(x)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't explicitly assign these, just use them directly whenever they're needed
it's an O(1) lookup so this doesn't save any time but makes code a bit more verbose

len_y = len(y)
for i in range(max(len_x, len_y)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generally speaking I wonder if the control flow can be simplified a bit here. you have a number of ifs but only two possible outputs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 outcomes (=,<,>), but in combination with nan, they increase.

there are 9 ifs of which 2 are not relevant (shortcut before the loop, fall-though after the loop).

7 ifs remaining:

  1. equality case (continue)
  2. greater than case
  3. smaller than case
  4. x is nan, but y is not
  5. y is nan, but x is not
  6. xvec is smaller than yvec
  7. yvec is smaller than xvec

I could merge some of the cases (with same return value), but that would compromise readability and require nan checks to be done earlier.

# check if the tuples have different lengths (shorter tuples
# first)
if i >= len_x:
return -1
if i >= len_y:
return +1
x_i_na = isna(x[i])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

x_is_na and y_is_na

y_i_na = isna(y[i])
# values are the same -> resolve tie with next element
if (x_i_na and y_i_na) or (x[i] == y[i]):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need (x_i_na and y_i_na)? Could one of them be null?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

equality check - both of the values are the same (both na or both the same non-na value).

continue
# check for nan values (sort nan to the end which is consistent
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could do this before checking for equality?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say no need to comment about consistency with numpy in the code (but thanks for noting it here!)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could do this before checking for equality?

sure, but why?

# with numpy
if x_i_na and not y_i_na:
return +1
if not x_i_na and y_i_na:
return -1
# normal greater/less than comparison
if x[i] < y[i]:
return -1
return +1
return 0

ixs = np.arange(len(values))
ixs = sorted(ixs, key=functools.cmp_to_key(cmp_func))
return values[ixs]

sorter = None
if (
not is_extension_array_dtype(values)
and lib.infer_dtype(values, skipna=False) == "mixed-integer"
):

ext_arr = is_extension_array_dtype(values)
ssche marked this conversation as resolved.
Show resolved Hide resolved
if not ext_arr and lib.infer_dtype(values, skipna=False) == "mixed-integer":
# unorderable in py3 if mixed str/int
ssche marked this conversation as resolved.
Show resolved Hide resolved
ordered = sort_mixed(values)
elif not ext_arr and values.size and isinstance(values[0], tuple):
# 1-D arrays with tuples of potentially mixed type (solves GH36562)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would skip this comment.
In general we don't leave references to issues in the code, git/blame keeps track of that
You have a comment upstairs in sort_tuples that explains what it does so I think no need to repeat here

ordered = sort_tuples(values)
else:
try:
sorter = values.argsort()
Expand Down
25 changes: 25 additions & 0 deletions pandas/tests/indexing/multiindex/test_multiindex.py
Original file line number Diff line number Diff line change
Expand Up @@ -91,3 +91,28 @@ def test_multiindex_get_loc_list_raises(self):
msg = "unhashable type"
with pytest.raises(TypeError, match=msg):
idx.get_loc([])


def test_combine_first_with_nan_index():
ssche marked this conversation as resolved.
Show resolved Hide resolved
mi1 = pd.MultiIndex.from_arrays(
[["b", "b", "c", "a", "b", np.nan], [1, 2, 3, 4, 5, 6]], names=["a", "b"]
)
df = pd.DataFrame({"c": [1, 1, 1, 1, 1, 1]}, index=mi1)
mi2 = pd.MultiIndex.from_arrays(
[["a", "b", "c", "a", "b", "d"], [1, 1, 1, 1, 1, 1]], names=["a", "b"]
)
s = pd.Series([1, 2, 3, 4, 5, 6], index=mi2)
df_combined = df.combine_first(pd.DataFrame({"col": s}))
mi_expected = pd.MultiIndex.from_arrays(
[
["a", "a", "a", "b", "b", "b", "b", "c", "c", "d", np.nan],
[1, 1, 4, 1, 1, 2, 5, 1, 3, 1, 6],
],
names=["a", "b"],
)
assert (df_combined.index == mi_expected).all()
ssche marked this conversation as resolved.
Show resolved Hide resolved
exp_col = np.asarray(
[1.0, 4.0, np.nan, 2.0, 5.0, np.nan, np.nan, 3.0, np.nan, 6.0, np.nan]
)
act_col = df_combined["col"].values
assert np.allclose(act_col, exp_col, rtol=0, atol=0, equal_nan=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For floating point equality use assert_almost_equal(res, expected)

7 changes: 7 additions & 0 deletions pandas/tests/test_sorting.py
Original file line number Diff line number Diff line change
Expand Up @@ -453,3 +453,10 @@ def test_extension_array_codes(self, verify, na_sentinel):
expected_codes = np.array([0, 2, na_sentinel, 1], dtype=np.intp)
tm.assert_extension_array_equal(result, expected_values)
tm.assert_numpy_array_equal(codes, expected_codes)


def test_mixed_str_nan():
values = np.array(["b", np.nan, "a", "b"], dtype=object)
result = safe_sort(values)
expected = np.array([np.nan, "a", "b", "b"], dtype=object)
tm.assert_numpy_array_equal(result, expected)