Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: fix dtype of all-NaN MultiIndex level #17934

Merged
merged 1 commit into from
Oct 29, 2017

Conversation

toobaz
Copy link
Member

@toobaz toobaz commented Oct 21, 2017

(This will need to be rebased on #17930 , but in the meanwhile it is useful for discussion)

An alternative, more radical, fix is to have pd.CategoricalIndex([np.nan]).dtype.categories float (currently object).

@codecov
Copy link

codecov bot commented Oct 21, 2017

Codecov Report

Merging #17934 into master will decrease coverage by 0.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #17934      +/-   ##
==========================================
- Coverage   91.23%   91.22%   -0.02%     
==========================================
  Files         163      163              
  Lines       50113    50116       +3     
==========================================
- Hits        45723    45717       -6     
- Misses       4390     4399       +9
Flag Coverage Δ
#multiple 89.03% <100%> (ø) ⬆️
#single 40.32% <66.66%> (-0.06%) ⬇️
Impacted Files Coverage Δ
pandas/core/categorical.py 95.74% <100%> (+0.01%) ⬆️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/frame.py 97.75% <0%> (-0.1%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 77b4bb3...62ced7b. Read the comment docs.

@codecov
Copy link

codecov bot commented Oct 21, 2017

Codecov Report

Merging #17934 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #17934      +/-   ##
==========================================
+ Coverage   91.24%   91.24%   +<.01%     
==========================================
  Files         163      163              
  Lines       50091    50099       +8     
==========================================
+ Hits        45704    45715      +11     
+ Misses       4387     4384       -3
Flag Coverage Δ
#multiple 89.05% <100%> (+0.02%) ⬆️
#single 40.24% <55.55%> (-0.06%) ⬇️
Impacted Files Coverage Δ
pandas/core/categorical.py 95.79% <100%> (+0.04%) ⬆️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/frame.py 97.75% <0%> (-0.1%) ⬇️
pandas/plotting/_converter.py 65.2% <0%> (+1.81%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5959ee3...a2680b9. Read the comment docs.

@jreback jreback added Compat pandas objects compatability with Numpy or Python functions MultiIndex labels Oct 21, 2017
@@ -2291,6 +2292,8 @@ def _factorize_from_iterable(values):
cat = Categorical(values, ordered=True)
categories = cat.categories
codes = cat.codes
if len(codes) and not len(categories):
categories = Float64Index([])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead I would just defer to Index(values), e.g.

In [16]: pd.Index([np.nan, np.nan])
Out[16]: Float64Index([nan, nan], dtype='float64')

In [17]: pd.Index([None, None])
Out[17]: Index([None, None], dtype='object')

In [18]: pd.Index([pd.NaT, pd.NaT])
Out[18]: DatetimeIndex(['NaT', 'NaT'], dtype='datetime64[ns]', freq=None)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would argue though that this should instead happen inside the Category constructor and not special case here

@toobaz
Copy link
Member Author

toobaz commented Oct 23, 2017

instead I would just defer to Index(values)

Unfortunately, this would work only for all-NaN lists, since [3, np.nan] will result in a float64 flat Index but in an int64 level of MultiIndex. Anyway...

I would argue though that this should instead happen inside the Category

... I agree, done.

While I was at it I removed some obsolete warnings.

@@ -356,19 +358,10 @@ def __init__(self, values, categories=None, ordered=None, dtype=None,

codes = _get_codes_for_values(values, dtype.categories)

# TODO: check for old style usage. These warnings should be removes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove these independently (IOW another PR), it can be before or after this one.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

sanitize_dtype = 'object'
else:
sanitize_dtype = None
null_mask = isna(values)
values = [values[idx] for idx in range(len(values))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

huh? use a vectorized method

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

huh? use a vectorized method

Notice vectorizing the indexes is the best I could find without transforming the list of values into an array (which must not happen before inferring the dtype).

@toobaz toobaz force-pushed the empty_level_dtype branch 3 times, most recently from fabc410 to e3a6cf2 Compare October 23, 2017 12:51
sanitize_dtype = 'object'
else:
sanitize_dtype = None
null_mask = isna(values)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is adding a lot of complexity (this whole) PR, pls see if you can simplify

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jeff, if I could simplify I would have done it already.

But notice this PR is actually simplifying the code path, potentially avoiding casting data from list to object ndarray to int64 ndarray (skipping the middle step). And the way missing values are treated seems to me much cleaner and clearer than before: all the type inference is just done after removing them, so there is less dtype guesswork.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok will have another look
can u run the category asv and report any changes

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BENCHMARKS NOT SIGNIFICANTLY CHANGED - should I copypaste the entire output?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, just want to make sure

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might want to add a benchmark for categorical creation (or a long list) with 2 cases (all nulls) and some non-null.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done: as expected, this only makes a difference if there are many NaNs. My guess is that it's taking half the time because it's checking the "nan-ity" of each value once instead than twice.

      before           after         ratio
     [e1dabf37]       [b539298c]
-        66.5±1ms       30.5±0.4ms     0.46  categoricals.Categoricals.time_constructor_all_nan

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, that's fine, just wanted to be sure

@toobaz toobaz force-pushed the empty_level_dtype branch 2 times, most recently from 3b3a002 to b539298 Compare October 24, 2017 13:54
sanitize_dtype = 'object'
else:
sanitize_dtype = None
null_mask = isna(values)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can move the

null_mask = isna(values)

to line 291 (where you set it to: null_mask = np.array(False)``

as we always need to check this (whether its an array or list-like anyhow)

Copy link
Member Author

@toobaz toobaz Oct 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as we always need to check this (whether its an array or list-like anyhow)

Well, no: conceptually, because if the user passes an array (or a pandas object) with missing values, it already has a dtype, which also applies to the missing values and we should respect, so any kind of inference is done on the full array; in practice, because I set it to np.array(False) precisely to avoid the cost of looking for missing values.

I can:

  • change the name of the variable: null_mask is not "the mask of null values" but it is "the mask of values we want to leave out for the inference steps", and the two are the same only when values is a list
  • document the above in a comment

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is: unless you have in mind future refactorings for which the location of missing values in an array matters.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you make the change as discussed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IOW compute null_mask at the top

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you make the change as discussed

As I explained above, computing null_mask at the top is a waste, we don't need it for arrays/ndframes/indexes, as factorize looks for missing values anyway.

@@ -370,6 +372,11 @@ def __init__(self, values, categories=None, ordered=None, dtype=None,
"mean to use\n'Categorical.from_codes(codes, "
"categories)'?", RuntimeWarning, stacklevel=2)

if null_mask.any():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a comment here

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comments and pls rebase

@@ -964,6 +964,7 @@ Indexing
- When called on an unsorted ``MultiIndex``, the ``loc`` indexer now will raise ``UnsortedIndexError`` only if proper slicing is used on non-sorted levels (:issue:`16734`).
- Fixes regression in 0.20.3 when indexing with a string on a ``TimedeltaIndex`` (:issue:`16896`).
- Fixed :func:`TimedeltaIndex.get_loc` handling of ``np.timedelta64`` inputs (:issue:`16909`).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move to 0.22

@toobaz
Copy link
Member Author

toobaz commented Oct 28, 2017

@jreback ping

@jreback jreback added this to the 0.22.0 milestone Oct 28, 2017
@@ -96,7 +96,7 @@ Conversion
Indexing
^^^^^^^^

-
- Bug in ``MultiIndex`` which would assign object dtype to all-NaN levels (:issue:`17929`).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you move this to Other API Changes section. It has a bit more visibility there. ping when pushed as this lgtm.

@toobaz
Copy link
Member Author

toobaz commented Oct 29, 2017

@jreback : ping

@jreback jreback merged commit cd64aea into pandas-dev:master Oct 29, 2017
@jreback
Copy link
Contributor

jreback commented Oct 29, 2017

thanks @toobaz

@toobaz toobaz deleted the empty_level_dtype branch October 29, 2017 21:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Compat pandas objects compatability with Numpy or Python functions MultiIndex
Projects
None yet
Development

Successfully merging this pull request may close these issues.

All-Nan MultiIndex level has different dtype than all-NaN flat Index
2 participants