BUG: fix dtype of all-NaN MultiIndex level #17934

toobaz · 2017-10-21T09:55:28Z

closes All-Nan MultiIndex level has different dtype than all-NaN flat Index #17929
tests added / passed
passes git diff master -u -- "*.py" | flake8 --diff
whatsnew entry

(This will need to be rebased on #17930 , but in the meanwhile it is useful for discussion)

An alternative, more radical, fix is to have pd.CategoricalIndex([np.nan]).dtype.categories float (currently object).

codecov · 2017-10-21T10:26:47Z

Codecov Report

Merging #17934 into master will decrease coverage by 0.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #17934      +/-   ##
==========================================
- Coverage   91.23%   91.22%   -0.02%     
==========================================
  Files         163      163              
  Lines       50113    50116       +3     
==========================================
- Hits        45723    45717       -6     
- Misses       4390     4399       +9

Flag	Coverage Δ
#multiple	`89.03% <100%> (ø)`	⬆️
#single	`40.32% <66.66%> (-0.06%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/categorical.py	`95.74% <100%> (+0.01%)`	⬆️
pandas/io/gbq.py	`25% <0%> (-58.34%)`	⬇️
pandas/core/frame.py	`97.75% <0%> (-0.1%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 77b4bb3...62ced7b. Read the comment docs.

codecov · 2017-10-21T10:26:53Z

Codecov Report

Merging #17934 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #17934      +/-   ##
==========================================
+ Coverage   91.24%   91.24%   +<.01%     
==========================================
  Files         163      163              
  Lines       50091    50099       +8     
==========================================
+ Hits        45704    45715      +11     
+ Misses       4387     4384       -3

Flag	Coverage Δ
#multiple	`89.05% <100%> (+0.02%)`	⬆️
#single	`40.24% <55.55%> (-0.06%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/categorical.py	`95.79% <100%> (+0.04%)`	⬆️
pandas/io/gbq.py	`25% <0%> (-58.34%)`	⬇️
pandas/core/frame.py	`97.75% <0%> (-0.1%)`	⬇️
pandas/plotting/_converter.py	`65.2% <0%> (+1.81%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5959ee3...a2680b9. Read the comment docs.

jreback · 2017-10-21T14:17:35Z

pandas/core/categorical.py

@@ -2291,6 +2292,8 @@ def _factorize_from_iterable(values):
        cat = Categorical(values, ordered=True)
        categories = cat.categories
        codes = cat.codes
+        if len(codes) and not len(categories):
+            categories = Float64Index([])


instead I would just defer to Index(values), e.g.

In [16]: pd.Index([np.nan, np.nan]) Out[16]: Float64Index([nan, nan], dtype='float64') In [17]: pd.Index([None, None]) Out[17]: Index([None, None], dtype='object') In [18]: pd.Index([pd.NaT, pd.NaT]) Out[18]: DatetimeIndex(['NaT', 'NaT'], dtype='datetime64[ns]', freq=None)

I would argue though that this should instead happen inside the Category constructor and not special case here

toobaz · 2017-10-23T08:11:45Z

instead I would just defer to Index(values)

Unfortunately, this would work only for all-NaN lists, since [3, np.nan] will result in a float64 flat Index but in an int64 level of MultiIndex. Anyway...

I would argue though that this should instead happen inside the Category

... I agree, done.

While I was at it I removed some obsolete warnings.

jreback · 2017-10-23T10:31:23Z

pandas/core/categorical.py

@@ -356,19 +358,10 @@ def __init__(self, values, categories=None, ordered=None, dtype=None,

            codes = _get_codes_for_values(values, dtype.categories)

-            # TODO: check for old style usage. These warnings should be removes


remove these independently (IOW another PR), it can be before or after this one.

jreback · 2017-10-23T10:32:00Z

pandas/core/categorical.py

                    sanitize_dtype = 'object'
                else:
                    sanitize_dtype = None
+                null_mask = isna(values)
+                values = [values[idx] for idx in range(len(values))


huh? use a vectorized method

huh? use a vectorized method

Notice vectorizing the indexes is the best I could find without transforming the list of values into an array (which must not happen before inferring the dtype).

jreback · 2017-10-23T12:55:07Z

pandas/core/categorical.py

                    sanitize_dtype = 'object'
                else:
                    sanitize_dtype = None
+                null_mask = isna(values)


This is adding a lot of complexity (this whole) PR, pls see if you can simplify

Jeff, if I could simplify I would have done it already.

But notice this PR is actually simplifying the code path, potentially avoiding casting data from list to object ndarray to int64 ndarray (skipping the middle step). And the way missing values are treated seems to me much cleaner and clearer than before: all the type inference is just done after removing them, so there is less dtype guesswork.

ok will have another look
can u run the category asv and report any changes

BENCHMARKS NOT SIGNIFICANTLY CHANGED - should I copypaste the entire output?

no, just want to make sure

might want to add a benchmark for categorical creation (or a long list) with 2 cases (all nulls) and some non-null.

Done: as expected, this only makes a difference if there are many NaNs. My guess is that it's taking half the time because it's checking the "nan-ity" of each value once instead than twice.

before after ratio [e1dabf37] [b539298c] - 66.5±1ms 30.5±0.4ms 0.46 categoricals.Categoricals.time_constructor_all_nan SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

yep, that's fine, just wanted to be sure

jreback · 2017-10-25T12:26:19Z

pandas/core/categorical.py

                    sanitize_dtype = 'object'
                else:
                    sanitize_dtype = None
+                null_mask = isna(values)


I think you can move the

null_mask = isna(values)

to line 291 (where you set it to: null_mask = np.array(False)``

as we always need to check this (whether its an array or list-like anyhow)

as we always need to check this (whether its an array or list-like anyhow)

Well, no: conceptually, because if the user passes an array (or a pandas object) with missing values, it already has a dtype, which also applies to the missing values and we should respect, so any kind of inference is done on the full array; in practice, because I set it to np.array(False) precisely to avoid the cost of looking for missing values.

I can:

change the name of the variable: null_mask is not "the mask of null values" but it is "the mask of values we want to leave out for the inference steps", and the two are the same only when values is a list

document the above in a comment

That is: unless you have in mind future refactorings for which the location of missing values in an array matters.

can you make the change as discussed

IOW compute null_mask at the top

can you make the change as discussed

As I explained above, computing null_mask at the top is a waste, we don't need it for arrays/ndframes/indexes, as factorize looks for missing values anyway.

jreback · 2017-10-25T12:27:17Z

pandas/core/categorical.py

@@ -370,6 +372,11 @@ def __init__(self, values, categories=None, ordered=None, dtype=None,
                     "mean to use\n'Categorical.from_codes(codes, "
                     "categories)'?", RuntimeWarning, stacklevel=2)

+        if null_mask.any():


add a comment here

jreback

comments and pls rebase

jreback · 2017-10-28T00:22:50Z

doc/source/whatsnew/v0.21.0.txt

@@ -964,6 +964,7 @@ Indexing
 - When called on an unsorted ``MultiIndex``, the ``loc`` indexer now will raise ``UnsortedIndexError`` only if proper slicing is used on non-sorted levels (:issue:`16734`).
 - Fixes regression in 0.20.3 when indexing with a string on a ``TimedeltaIndex`` (:issue:`16896`).
 - Fixed :func:`TimedeltaIndex.get_loc` handling of ``np.timedelta64`` inputs (:issue:`16909`).


move to 0.22

toobaz · 2017-10-28T08:16:19Z

@jreback ping

jreback · 2017-10-28T19:15:20Z

doc/source/whatsnew/v0.22.0.txt

@@ -96,7 +96,7 @@ Conversion
 Indexing
 ^^^^^^^^

-
+- Bug in ``MultiIndex`` which would assign object dtype to all-NaN levels (:issue:`17929`).


can you move this to Other API Changes section. It has a bit more visibility there. ping when pushed as this lgtm.

toobaz · 2017-10-29T09:14:00Z

@jreback : ping

jreback · 2017-10-29T19:11:12Z

thanks @toobaz

…v#17934)

toobaz mentioned this pull request Oct 21, 2017

All-Nan MultiIndex level has different dtype than all-NaN flat Index #17929

Closed

jreback added Compat pandas objects compatability with Numpy or Python functions MultiIndex labels Oct 21, 2017

jreback requested changes Oct 21, 2017

View reviewed changes

toobaz force-pushed the empty_level_dtype branch from 62ced7b to 1a2d03d Compare October 23, 2017 08:08

toobaz force-pushed the empty_level_dtype branch from 1a2d03d to 8ce8efb Compare October 23, 2017 09:27

jreback requested changes Oct 23, 2017

View reviewed changes

toobaz force-pushed the empty_level_dtype branch 3 times, most recently from fabc410 to e3a6cf2 Compare October 23, 2017 12:51

jreback reviewed Oct 23, 2017

View reviewed changes

toobaz force-pushed the empty_level_dtype branch 2 times, most recently from 3b3a002 to b539298 Compare October 24, 2017 13:54

jreback requested changes Oct 25, 2017

View reviewed changes

jreback requested changes Oct 28, 2017

View reviewed changes

toobaz force-pushed the empty_level_dtype branch from b539298 to cf85f9f Compare October 28, 2017 06:26

jreback added this to the 0.22.0 milestone Oct 28, 2017

jreback approved these changes Oct 28, 2017

View reviewed changes

jreback reviewed Oct 28, 2017

View reviewed changes

toobaz force-pushed the empty_level_dtype branch from cf85f9f to fb24cc6 Compare October 28, 2017 23:56

BUG: fix dtype of all-NaN categories and MultiIndex levels

a2680b9

toobaz force-pushed the empty_level_dtype branch from fb24cc6 to a2680b9 Compare October 29, 2017 07:57

jreback merged commit cd64aea into pandas-dev:master Oct 29, 2017

toobaz deleted the empty_level_dtype branch October 29, 2017 21:26

toobaz mentioned this pull request Oct 29, 2017

Remove old warnings (plus some useless code) #18022

Merged

2 tasks

peterpanmj pushed a commit to peterpanmj/pandas that referenced this pull request Oct 31, 2017

BUG: fix dtype of all-NaN categories and MultiIndex levels (pandas-de…

d85f7de

…v#17934)

No-Stream pushed a commit to No-Stream/pandas that referenced this pull request Nov 28, 2017

BUG: fix dtype of all-NaN categories and MultiIndex levels (pandas-de…

b945703

…v#17934)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: fix dtype of all-NaN MultiIndex level #17934

BUG: fix dtype of all-NaN MultiIndex level #17934

toobaz commented Oct 21, 2017

codecov bot commented Oct 21, 2017

codecov bot commented Oct 21, 2017 •

edited

Loading

jreback Oct 21, 2017

jreback Oct 21, 2017

toobaz commented Oct 23, 2017

jreback Oct 23, 2017

toobaz Oct 23, 2017

jreback Oct 23, 2017

toobaz Oct 23, 2017

jreback Oct 23, 2017

toobaz Oct 23, 2017

jreback Oct 24, 2017

toobaz Oct 24, 2017

jreback Oct 24, 2017

jreback Oct 24, 2017

toobaz Oct 24, 2017

jreback Oct 25, 2017

jreback Oct 25, 2017

toobaz Oct 25, 2017 •

edited

Loading

toobaz Oct 25, 2017

jreback Oct 28, 2017

jreback Oct 28, 2017

toobaz Oct 28, 2017

jreback Oct 25, 2017

jreback left a comment

jreback Oct 28, 2017

toobaz commented Oct 28, 2017

jreback Oct 28, 2017

toobaz commented Oct 29, 2017

jreback commented Oct 29, 2017

		@@ -356,19 +358,10 @@ def __init__(self, values, categories=None, ordered=None, dtype=None,

		codes = _get_codes_for_values(values, dtype.categories)

		# TODO: check for old style usage. These warnings should be removes

BUG: fix dtype of all-NaN MultiIndex level #17934

BUG: fix dtype of all-NaN MultiIndex level #17934

Conversation

toobaz commented Oct 21, 2017

codecov bot commented Oct 21, 2017

Codecov Report

codecov bot commented Oct 21, 2017 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

toobaz commented Oct 23, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

toobaz Oct 25, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

toobaz commented Oct 28, 2017

Choose a reason for hiding this comment

toobaz commented Oct 29, 2017

jreback commented Oct 29, 2017

codecov bot commented Oct 21, 2017 •

edited

Loading

toobaz Oct 25, 2017 •

edited

Loading