-
-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: fix dtype of all-NaN MultiIndex level #17934
Conversation
Codecov Report
@@ Coverage Diff @@
## master #17934 +/- ##
==========================================
- Coverage 91.23% 91.22% -0.02%
==========================================
Files 163 163
Lines 50113 50116 +3
==========================================
- Hits 45723 45717 -6
- Misses 4390 4399 +9
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #17934 +/- ##
==========================================
+ Coverage 91.24% 91.24% +<.01%
==========================================
Files 163 163
Lines 50091 50099 +8
==========================================
+ Hits 45704 45715 +11
+ Misses 4387 4384 -3
Continue to review full report at Codecov.
|
pandas/core/categorical.py
Outdated
@@ -2291,6 +2292,8 @@ def _factorize_from_iterable(values): | |||
cat = Categorical(values, ordered=True) | |||
categories = cat.categories | |||
codes = cat.codes | |||
if len(codes) and not len(categories): | |||
categories = Float64Index([]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead I would just defer to Index(values)
, e.g.
In [16]: pd.Index([np.nan, np.nan])
Out[16]: Float64Index([nan, nan], dtype='float64')
In [17]: pd.Index([None, None])
Out[17]: Index([None, None], dtype='object')
In [18]: pd.Index([pd.NaT, pd.NaT])
Out[18]: DatetimeIndex(['NaT', 'NaT'], dtype='datetime64[ns]', freq=None)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would argue though that this should instead happen inside the Category
constructor and not special case here
62ced7b
to
1a2d03d
Compare
Unfortunately, this would work only for all-NaN lists, since
... I agree, done. While I was at it I removed some obsolete warnings. |
1a2d03d
to
8ce8efb
Compare
pandas/core/categorical.py
Outdated
@@ -356,19 +358,10 @@ def __init__(self, values, categories=None, ordered=None, dtype=None, | |||
|
|||
codes = _get_codes_for_values(values, dtype.categories) | |||
|
|||
# TODO: check for old style usage. These warnings should be removes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove these independently (IOW another PR), it can be before or after this one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
pandas/core/categorical.py
Outdated
sanitize_dtype = 'object' | ||
else: | ||
sanitize_dtype = None | ||
null_mask = isna(values) | ||
values = [values[idx] for idx in range(len(values)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
huh? use a vectorized method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
huh? use a vectorized method
Notice vectorizing the indexes is the best I could find without transforming the list of values into an array (which must not happen before inferring the dtype).
fabc410
to
e3a6cf2
Compare
sanitize_dtype = 'object' | ||
else: | ||
sanitize_dtype = None | ||
null_mask = isna(values) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is adding a lot of complexity (this whole) PR, pls see if you can simplify
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Jeff, if I could simplify I would have done it already.
But notice this PR is actually simplifying the code path, potentially avoiding casting data from list
to object
ndarray to int64
ndarray (skipping the middle step). And the way missing values are treated seems to me much cleaner and clearer than before: all the type inference is just done after removing them, so there is less dtype guesswork.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok will have another look
can u run the category asv and report any changes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BENCHMARKS NOT SIGNIFICANTLY CHANGED
- should I copypaste the entire output?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, just want to make sure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might want to add a benchmark for categorical creation (or a long list) with 2 cases (all nulls) and some non-null.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done: as expected, this only makes a difference if there are many NaNs. My guess is that it's taking half the time because it's checking the "nan-ity" of each value once instead than twice.
before after ratio
[e1dabf37] [b539298c]
- 66.5±1ms 30.5±0.4ms 0.46 categoricals.Categoricals.time_constructor_all_nan
SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, that's fine, just wanted to be sure
3b3a002
to
b539298
Compare
sanitize_dtype = 'object' | ||
else: | ||
sanitize_dtype = None | ||
null_mask = isna(values) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can move the
null_mask = isna(values)
to line 291 (where you set it to: null_mask
= np.array(False)``
as we always need to check this (whether its an array or list-like anyhow)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as we always need to check this (whether its an array or list-like anyhow)
Well, no: conceptually, because if the user passes an array
(or a pandas object) with missing values, it already has a dtype, which also applies to the missing values and we should respect, so any kind of inference is done on the full array; in practice, because I set it to np.array(False)
precisely to avoid the cost of looking for missing values.
I can:
- change the name of the variable:
null_mask
is not "the mask of null values" but it is "the mask of values we want to leave out for the inference steps", and the two are the same only whenvalues
is a list - document the above in a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is: unless you have in mind future refactorings for which the location of missing values in an array matters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you make the change as discussed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IOW compute null_mask at the top
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you make the change as discussed
As I explained above, computing null_mask
at the top is a waste, we don't need it for arrays/ndframes/indexes, as factorize
looks for missing values anyway.
@@ -370,6 +372,11 @@ def __init__(self, values, categories=None, ordered=None, dtype=None, | |||
"mean to use\n'Categorical.from_codes(codes, " | |||
"categories)'?", RuntimeWarning, stacklevel=2) | |||
|
|||
if null_mask.any(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a comment here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comments and pls rebase
doc/source/whatsnew/v0.21.0.txt
Outdated
@@ -964,6 +964,7 @@ Indexing | |||
- When called on an unsorted ``MultiIndex``, the ``loc`` indexer now will raise ``UnsortedIndexError`` only if proper slicing is used on non-sorted levels (:issue:`16734`). | |||
- Fixes regression in 0.20.3 when indexing with a string on a ``TimedeltaIndex`` (:issue:`16896`). | |||
- Fixed :func:`TimedeltaIndex.get_loc` handling of ``np.timedelta64`` inputs (:issue:`16909`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move to 0.22
b539298
to
cf85f9f
Compare
@jreback ping |
doc/source/whatsnew/v0.22.0.txt
Outdated
@@ -96,7 +96,7 @@ Conversion | |||
Indexing | |||
^^^^^^^^ | |||
|
|||
- | |||
- Bug in ``MultiIndex`` which would assign object dtype to all-NaN levels (:issue:`17929`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you move this to Other API Changes section. It has a bit more visibility there. ping when pushed as this lgtm.
cf85f9f
to
fb24cc6
Compare
fb24cc6
to
a2680b9
Compare
@jreback : ping |
thanks @toobaz |
git diff master -u -- "*.py" | flake8 --diff
(This will need to be rebased on #17930 , but in the meanwhile it is useful for discussion)
An alternative, more radical, fix is to have
pd.CategoricalIndex([np.nan]).dtype.categories
float (currently object).