-
-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Preserve categorical dtypes in MultiIndex levels (#13743) #13854
Conversation
Current coverage is 85.27% (diff: 100%)@@ master #13854 diff @@
==========================================
Files 139 139
Lines 50555 50561 +6
Methods 0 0
Messages 0 0
Branches 0 0
==========================================
+ Hits 43111 43116 +5
- Misses 7444 7445 +1
Partials 0 0
|
@@ -855,3 +855,4 @@ Bug Fixes | |||
|
|||
- Bug in ``.to_excel()`` when DataFrame contains a MultiIndex which contains a label with a NaN value (:issue:`13511`) | |||
- Bug in ``pd.read_csv`` in Python 2.x with non-UTF8 encoded, multi-character separated data (:issue:`3404`) | |||
- Bug in ``MultiIndex.from_array`` and ``.from_product`` doesn't preserve categorical dtypes in ``MultiIndex`` levels and, consequently, in results of ``groupby`` and ``set_index`` (:issue:`13743`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just say MultiIndex
constructor
Other places where a cidx = pd.CategoricalIndex(['y', 'x'], categories=list("xyz"), ordered=True)
cidx_nonunique = pd.CategoricalIndex(['y', 'x', 'y'], categories=list("xyz"), ordered=True)
|
@pijucha 5 & 6 you can ignore |
@pijucha this should exist as a separate function from the can certainly merge this fix and then do a followup with a more reaching name change. lmk. |
if is_categorical(values): | ||
if isinstance(values, (ABCCategoricalIndex, ABCSeries)): | ||
values = values._values | ||
categories = CategoricalIndex(values.categories, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this a CI
and not just a Categorial
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For consistency. The else
part returns cat.categories
, which is an Index.
@jreback OK, Thanks. |
Sorry for a bit of delay. I fixed |
@@ -1607,6 +1607,113 @@ def test_map(self): | |||
result = c.map(lambda x: 1) | |||
tm.assert_numpy_array_equal(result, np.array([1] * 5, dtype=np.int64)) | |||
|
|||
def test_groupby_preserve_dtype(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move to test_groupby (for the groupby tests)
small changes. looks pretty good. |
I moved some tests to files I thought were more appropriate (though, I'm not 100% sure). |
@@ -864,9 +862,9 @@ def from_arrays(cls, arrays, sortorder=None, names=None): | |||
if len(arrays[i]) != len(arrays[i - 1]): | |||
raise ValueError('all arrays must be same length') | |||
|
|||
cats = [Categorical.from_array(arr, ordered=True) for arr in arrays] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we use Categorical.from_array
any longer? (in the codebase)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are just a few places: internals of concat
and unstack
, LegacyTable
in pytables, and panel_index
. I can replace them all with _factorize
- it's pretty straightforward. It wouldn't automatically fix concat
and unstack
(that's why I opened separate issues for them) but wouldn't hurt either.
If we replace them, what should be done to the definition of Categorical.from_array
? Remove completely or rather add a deprecation warning?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ideally we should replace these and deprecate the constructor. but can do that later / another PR if desired.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. I'll try to do it today if it goes smoothly. Otherwise, I'll do a follow up as soon as this PR is merged in.
Just a question:
Categorical.from_array
should emit a FutureWarning
with a comment like this: "Categorical.from_array is deprecated, use Categorical instead"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
Deprecated |
update |
lgtm. @sinhrks @jorisvandenbossche |
@pijucha can you rebase |
…eprecate .from_array Now, categorical dtype is preserved also in `groupby`, `set_index`, `stack`, `get_dummies`, and `make_axis_dummies`.
@jreback Done (rebase + small update to tests/test_reshape.py). One build on travis has probably stalled - should I resubmit? |
@pijucha i restarted the build. ping when green. |
thanks @pijucha really nice PR. touches lots of parts! |
Deprecated in 0.19.0 xref pandas-devgh-13854.
Deprecated in 0.19.0 xref gh-13854.
git diff upstream/master | flake8 --diff
This commit modifies
MultiIndex.from_array
andMultiIndex.from_product
.Example:
Previously, the results were:
This modification makes
groupby
,pivot
, andset_index
preserve categorical types in indexes.